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ABSTRACT 


The  theory  for  finding  optimal  policies  for  Markov  processes  with 
transition  rewards  and  many  alternatives  in  each  state,  when  the  transi¬ 
tion  probabilities  are  given,  has  already  been  developed.  But  in  most 
practical  applications,  these  transition  probabilities  are  not  known 
exactly --one  has  only  some  prior  knowledge  about  them. 

This  problem  of  uncertain  transition  probabilities  was  first  treated 
by  Dr.  E.  A.  Silver  in  the  Interim  Technical  Report  #1  of  the  Operations 
Research  Center  of  M.  I.  T.  Section  I  of  the  present  report  extends  sev¬ 
eral  of  Silver's  results.  We  propose  a  dynamic  programming  formulation 
for  the  problem  of  choosing  an  optimal  operating  strategy  and  we  carry 
out  the  solution  for  a  special  two-state  example.  However,  it  is  found 
that  solution  of  non-trivial  problems  of  any  higher  dimension  is  impracti¬ 
cal. 


Section  II  is  concerned  with  experimental  and  heuristic  approaches  to 
the  problem,  and  relies  upon  simulation  rather  than  upon  analysis.  We 
investigate  certain  statistics  of  the  process  when  the  unknown  transition 
probabilities  are  governed  by  a  multi -dimensional  beta  prior  (a  convenient 
form  for  Bayes  modification).  We  find  that  the  process  with  known  prob¬ 
abilities  which  are  equal  to  the  mean  values  of  the  unknown  probabilities, 
provides  us  with  a  remarkably  good  picture  of  the  unknown  process.  The 
hypothesis  is  stated  and  investigated  that  this  process  of  expected  values 
is  adequate  for  decision  purposes,  and  that  determining  decisions  from  it 
is  feasible  as  well  as  useful. 

Finally,  some  alternative  approaci  es  are  suggested  for  cases  which 
might  not  be  handled  by  the  expected  values  technique. 
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CPiAPTER  I 
INTRODUCTION 

1 .  1  Statement  of  problem 

The  process  which  we  are  considering  is  a  multistate  discrete  time 
Markov  process  with  transition  rewards.  The  decision  structure  consists 
of  a  set  of  possible  alternative  actions  in  each  state.  Each  such  action  has 
associated  with  it  a  unique  set  of  transition  probabilities  and  rewards.  At 
each  time  period,  an  alternative  must  be  specified  for  the  state  currently 
occupied.  A  set  of  alternatives,  one  for  each  state,  is  called  a  policy. 

We  might  illustrate  the  structure  of  the  problem  by  setting  forth  a 
simple  example  that  will  be  familiar  to  those  who  have  read  reference  (1). 

A  taxi  driver  operates  between  three  towns:  A,  B,  and  C.  The  probability 
of  -lis  picking  up  a  fare  to  a  particular  destination  is  dependent  upon  where 
he  is  now,  so  we  set  up  a  Markovian  model.  But  in  addition  we  assume  that 
he  can  follow  one  of  three  different  courses  of  action: 

1.  He  may  cruise  around  and  wait  to  be  hailed. 

2.  He  may  wait  at  a  cab  stand  for  a  fare  to  come  along. 

3.  He  may  wait  at  a  radio  call  box  for  a  call  to  come  through. 

Of  course,  his  choice  cf  alternative  will  alter  his  probabilities  of 
picking  up  fares  to  the  various  towns --for  instance,  it  may  be  more  likely 
that  a  radio  call  will  be  for  a  long  trip.  Also,  the  rewards  will  be  influ¬ 
enced  by  the  alternatives  because  of  the  differing  costs  involved- -it  is 
cheaper  to  wait  at  a  cab  stand  than  to  cruise  'although  ii  may  take  longer 
to  get  a  fare). 
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Let  us  assume  that  all  three  alternatives  are  open  to  our  driver,  except 
ill  town  B  where  there  is  no  radio  call  box,  so  that  alternative  3  is  not 
possible.  We  might  find  the  following  probabilities  and  rewards  to  prevail: 


Probability  Matrix 

Town  A 

Town  B 

Town  C 

Town  A  1 

1/2 

1/4 

1/4 

2 

1/16 

3/4 

3/16 

3 

1/4 

1/8 

5/8 

Town  B  1 

1/2 

0 

1/2 

2 

1/16 

7/8 

1/16 

Town  C  1 

1/4 

1/4 

1/2 

2 

1/8 

3/4 

1/8 

3 

3/4 

1/16 

3/16 

Reward 

Matrix 

Town  A 

Town  B 

Town  C 

Town  A  1 

10 

4 

8 

2 

8 

2 

4 

3 

4 

6 

4 

Town  B  1 

14 

0 

18 

2 

8 

16 

8 

Town  C  1 

10 

2 

8 

2 

6 

4 

2 

3 

4 

0 

6 
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This  then  is  the  structure  of  the  problem  we  shall  consider.  When  all 
transition  probabilities  and  rewards  are  known,  it  is  possible  to  find  the 
optimal  policy  for  an  operation  which  lasts  for  a  finite  time  with  terminal 
rewards,  or  one  which  lasts  indefinitely  long.  Also,  in  this  infinite  dura¬ 
tion  case,  we  might  maximize  the  expected  gain  per  period,  or,  alternatively 
we  might  prefer  to  discount  the  future  rewards  and  maximize  the  present 
value  of  the  infinitely  long  reward  stream.  Both  of  these  cases  have  been 
solved  by  Howard's  policy  iteration  method.  (1) 

The  work  reported  in  this  technical  note  concerns  the  Markov  decision 
problem  in  which  the  decision  maker  has  impe.  feet  knowledge  of  the  transi¬ 
tion  probabilities  for  the  process.  We  assume  that  we  have  some  prior 
knowledge  about  these  unknown  transition  probabilities,  and  that  this  infor¬ 
mation  is  expressed  in  a  probabilistic  manner.  We  can  gain  information 
about  these  unknowns  by  observing  transitions  as  the  process  continues. 

This  problem  becomer  trivial  in  the  infinite  duration,  no  discounting 
case.  The  optimal  solution  to  the  decision  problem  is  to  experiment  in¬ 
definitely  long  with  every  alternative  to  find  the  exact  values  of  the  unknown 
transition  probabilities,  then  solve  the  deterministic  decision  problem  by 
Howard's  policy  iteration  to  find  a  policy  to  use  for  an  infinite  amount  of 
time.  This  follows  because  even  an  infinitesimal  improvement  in  the  ex¬ 
pected  gain  per  period  is  worth  an  infinite  amount  in  this  case. 

However,  when  time  has  value,  this  becomes  quite  a  different  (and 
difficult)  problem.  We  must  do  some  experimentation  to  find  the  best 
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policy,  but  the  longer  we  experiment,  the  less  that  best  policy  becomes 
worth  to  us.  We  would  expect  that  for  any  discount  factor,  there  would 
exist  a  best  strategy  in  an  expected  value  sense.  Such  a  strategy  would 
specify  the  best  choice  of  alternative  in  each  state  for  each  distinct  set 
of  prior  knowledge.  Since  prior  knowledge  is  updated  in  time  as  the  pro¬ 
cess  operates  (giving  us  information),  the  best  choice  of  alternative 
changes  in  time  as  well.  It  is  important  to  see  that  the  optimal  strategy 
since  it  is  optimal  in  an  expected  value  sense,  would  not  necessarily 
ever  lead  us  to  follow  that  policy  which  is  in  fact  optimal  (that  is,  we 
would  follow  if  we  knew  all  the  transition  probabilities  exactly).  The 
optimal  strategy  will,  rather,  give  the  best  trade -off  between  search  and 
immediate  earnings. 

In  this  report  we  will  consider  a  variety  of  approaches  to  this  prob¬ 
lem.  Heuristic  approaches  useful  for  large  problems  and  analytic  methods 
suitable  for  small  problems  will  be  examined. 


-4- 


1 . 2  Notation 


The  discrete  time  Markov  process  with  N  states  and  with  only  one 

alternative  In  each  state  is  specified  by  a  transition  probability  matrix 

whose  element  p. .  gives  the  conditional  probability  of  a  transition  from  i 
^  i3 

to  j  given  that  the  process  is  in  state  i. 

The  reward  structure  is  specified  by  a  reward  matrix  R  whose  elemert 

r..  gives  the  reward  earned  by  th^  process  for  making  a  transition  from  state 
^i3 

i  to  state  j. 

The  steady  state  probabilities  for  this  process  are  denoted  by  a  vector 

TT  =  (  Jr  ,  jr  . . TT  . )  (we  assume  completely  ergotic  processes. ) 

**  1  fa  IN 

The  expected  reward  on  the  next  transition,  given  that  we  are  in 
state  i,  is  given  by: 

q-  =  /  p..  r.. 

1  ij  iJ 

j 

Finally,  the  steady  state  gain,  which  is  expected  reward  per  transition, 
is: 


i 


If  discounting  is  used,  so  that  the  present  value  of  one  dollar  to  be 
received  one  period  in  the  future  is  p  then  the  expected  present  discounted 
value  of  the  infinite  stream  generated  by  the  one  policy  process  starting  in 
state  i  is  denoted  by  v  .  .  Clearly  v.  depends  on  the  starting  state  since 
rewards  in  the  near  future  are  the  most  important-  The  column  vector, 

V  ,  of  these  values  v.  is  found  from  the  P  matrix  and  the  vector: 
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V 


=  [  1  -  p  Pi ‘‘a 


Now,  let  there  be  k.  different  alternatives  in  state  i,  for  i  =  1, 

1 


2 . N.  There  are  then  kj  *  k^ 


possible  policies  under  which 

N 


the  process  may  operate.  For  each  policy  there  is  a  P  matrix,  a  t 
vector,  and  a  v  vector.  We  danote  alternatives  with  a  lower  case  super- 
script,  so  that  p..  is  the  conditional  probability  of  a  transition  to  state 
j  given  that  we  are  in  state  i  and  operating  under  alternative  k.  We  use 
upper  case  superscripts  to  denote  policies,  so  ^  is,  the  tt  vector  for 
policy  A. 

Finally,  the  entire  P  matrix  for  the  problem  has  k.  rows,  and 

i  =  1 


each  entry  is  an  unknown  about  which  we  have  some  prior  knowledge.  Be¬ 
cause  we  will  soon  need  some  precise  terminology  to  distinguish  just  what 
process  we  are  referring  to,  we  decide  to  call  this  process  with  unknown 
transition  probabilities  the  primary  process,  since  it  is  the  primary  focus 
of  our  concern.  We  call  the  entries  random  variables,  over  -yhich  we  have 
some  prior  distribution.  This  is  the  Bayesian  formulation  of  unknowns. 


Because  the  statistics  associated  with  the  primary  process  are  random 

Ic 

variables,  we  write  them  with  a  tilde  above  them.  Thus,  p..  is  the  unknown 
probability  of  going  to  j  under  alternative  k  given  that  we  are  in  state  i. 

These  are  random  variables  and  each  one  has  some  distribution,  we  shall 
select  such  a  distribution  from  a  known  parametric  family.  Thus  with  each 
random  variable  is  associated  one  or  more  prior  parameters.  These  are 

just  numbers,  and  for  reasons  which  will  become  clear  in  the  next  chapter, 

,  .  ,  .  k 

we  designate  the  prior  parameters 

Later  it  will  be  convenient  to  speak  of  another  process,  whose  transition 
probabilities  are  known  exactly  and  are  equal  to  the  mean  value  of  the  corre¬ 
sponding  probabilities  for  the  primary  process.  We  will  call  this  process 
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•n 

T1 


on 


the  expected  process.  Because  we  shall  refer  to  it  so  often,  statistics 
associated  with  this  process  will  be  written  without  extra  tildes  or  super¬ 
scripts.  Thus  by  definition: 


Py  =  E  (  P;.  ) 


and  TT .  is  the  steady  state  probability  of  being  in  state  i  for  the  expected 

^  /V  ^ 

process.  Note  that  since  t  .  is  not  in  general  linear  in  p..  ,  it  is  not 

necessary  that  t  =  E  (  .  ). 

3  3 

Since  the  primary  process  is  composed  of  random  variables,  we  will 

often  find  it  useful  (for  simulation  purposes)  to  draw  sample  points  from  these 

distributions  to  get  processes  which  we  will  call  sample  processes.  Statistics 

associated  with  these  processes  will  be  denoted  by  a  presuperscript  £.  Thus 
s  A 

TT^  is  the  steady  state  probability  of  being  in  state  i  under  policy  A  for  ^ 
particular  sample  process  s  . 

Thus,  variables  associated  with  the  primary  process  have  tildes,  vari¬ 
ables  associated  with  the  expected  process  have  no  extra  markings,  and  vari¬ 
ables  associated  with  sample  processes  have  a  presuperscript  s.  Because 
understanding  of  the  relationship  of  these  various  processes  is  so  important 
to  vdiat  follows,  we  conclude  this  section  with  a  graphical  portrait  of  their 
relationship. 


We  have  an  actual  process,  a  true  state  of  nature,  but  unfortunately, 
we  do  not  know  it  exactly: 


Actual  process 

unknown 

numbers 
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Being  Bayesians,  we  treat  the  unknown  transition  probabilities 
as  random  variables,  p..  ,  whose  distribution  is  indicated  by  a  para- 
meter  m  . .  .  Thus : 


primary  process 

^1  /w  1  1 

P  1 1  P  12  “  P  IN 

2 

Pll  ••• 

• 

prior  parameters 

1  1  1 
mil  ”^12  ■  ■  ■  ■  ”^1N 

2 

mil  ... 

1 

1 

P21 

”^21  ■  ” 

k 

k 

/V 

N 

_Pni  ■■■ 

"^Nl 

(random  variables)  (known  numbers) 


For  convenience  in  later  analysis  we  define  still  another  process: 
expected  process  expected  process 


1  '^1 

1  1 

E(Pi,) 

P  11  •  •  •  •  Pin 

^  2 

2 

E(P,,)... 

Pll  .... 

1 

1 

E(P2i) 

P21  .... 

(known  numbers) 

(known  numbers) 
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Finally  fronn  the  prior  distribution  we  can  draw: 


sample  process  1  sample  process  2 


(  known  numbers  )  (  known  numbers  ) 


sample  process  3 


( known  numbers  ) 


CHAPTER  II 


THE  MULTI-DIMENSIONAL  BETA  DISTRIBUTION 


1.  1  In  search  of  a  prior  distribution 

The  problem  we  have  formulated  assumes  that  there  is  prior  know¬ 
ledge  of  the  unknown  transition  probabilities  of  the  primary  process,  and 
that  this  knowledge  is  to  be  expressed  probabilistically.  We  shall  now  dis¬ 
play  a  convenient  form  in  which  to  express  this  knowledge. 


Assume  that  at  some  moment  the  system  is  in  state  i.  The  immediate 

future  of  the  system  can  be  described  as  a  multi-nomial  Bernoulli  process, 

Tl'.at  is,  one  and  only  one  transition  to  another  state  will  take  place,  and 

that  transition  will  be  made  according  to  the  probabilities  p..,  j  =  1, 2.  .  .  .  N. 

^  J 

If  we  consider  only  those  transitions  made  out  of  state  i  we  have  a  multi¬ 
nomial  process : 


pr  (  E  .  I  p , , ,  p 


n 


n 


n 


i2’ 


^iN^  ~ 


C-  P 


N 


il 


i2 


iN 


Where  E  is  the  event  of  observing  exactly  n^  transitions  from 
state  i  to  state  1,  n^  from  i  to  2,  etc.  The  conjugate  prior  for  this  distrib 
tion  is  the  multidimensional  beta  distribution: 


il 


PiN'^'iL 


n^iN)  =  C 


m .  -1 

il 


P. 


iN 


iN 


We  shall  shortly  show  that  the  multi-dimensional  beta  is  an  excep¬ 
tionally  convenient  form  to  use  for  the  distribution  for  p.,  ,  p  . . 

il  i2 

But  first  it  must  be  noted  that  use  of  this  prior  involves  the  implicit  assump¬ 
tion  that  the  random  variables  in  each  row  of  the  transition  matrix  are 
independent  of  the  random  variables  in  any  other  row  of  the  matrix. 
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That  this  would  in  fact  be  the  case  in  any  application  is  not  obvious,  and 
in  some  sense  it  is  even  unlikely.  In  the  example  proposed  in  the  introduction, 
a  similar  type  of  alternative  existed  in  every  state.  If  those  probabilities  were 
unknown,  information  about  alternative  1  in  state  1  might  well  give  us  some 
clues  as  to  the  transition  probabilities  for  the  same  alternative  in  state  2. 

So,  vdiile  we  shall  use  the  multidimensional  beta  distribution  throughout,  in 
order  to  keep  the  calculations  simple,  it  should  be  borne  in  mind  that  this 
implicit  assumption  of  independence  may  not  be  desirable  in  some  applications. 

Throughout  this  report  we  shall  denote  by  m  the  beta  parameters  used, 

and  shall  denote  by  N  .  the  row  sum  of  the  m  . 

1  ij 


j 


We  shall  now  consider  some  of  the  properties  of  this  distribution,  and 
an  interpretation  for  the  m  . .. 

2.  2  Properties  of  the  multidimensional  beta  distribution 


We  simply  list  some  of  the  general  properties  of  the  multidimensional  bet': 
distribution. 

1.  The  marginal  distribution  of  a  particular  p  . .  is  given  by: 


f  (p..  m..,  N.  -  m..) 

P  ij  iJ  1  U 

2.  The  expected  value  of  a  particular  p  . .  is  given  by: 

3.  The  variance  of  a  particular  p  . .  is  given  by: 

,  rn . .  m  . .  , 

X  =  - L!_  (  1  .  - !i_  )  - ! - 

^ij  N.  '  N.  'N.+  1 

1  11 


"  Pij  (  >  -  Pij)  (Ni+  1) 
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4.  Bayes  modification  of  the  distribution  can  ba  accomplished  by  inspec 

tion.  Let  E  be  the  event  that  we  observe  f.,  transitions  i  m  i  to  j, 

U 

j  =  1,2 . N.  Then  the  posterior  distribution  is: 


S  ‘Pil  '  Pi2  ■ 


P: 


iN 


m.,+  f.,.  m  +f. 


il 


il 


i2 


M  f  M  ) 

iN  iN 


2.  3  Determinatioti  of  prior  parameters 

The  expression  for  the  mean  p_  gives  us  an  interpretation  of  the 
prior  parameters,  and  thus  an  intuitive  way  of  assigning  them.  Since 


m . . 


F  (  p  . . )  =  - r; — -  ,  we  can  specify,  instead  of  the  m  . .  ,  N-1  of  the  mean 

ij  N.  -  '  ij 

values  and  a  value  for  N^.  From  these  statistics,  the  prior  parameters  can 

be  calculated.  N.  is  some  kind  of  a  measure  of  our  certainty,  as  we  can 

1 

see  by  noting  that  the  variance  is  inversely  proportional  to  N^  +  1.  E.  A. 
Silver  (3)  notes  that  good  resulto  can  be  obtained  by  a  least  squares  fit  of 
the  intuitive  marginal  variances  to  a  single  value  N..  But  in  most  applica¬ 
tions  it  is  likely  that  one  will  not  have  a  good  idea  of  the  variances  which  are 
to  be  so  fit  --so  N.  will  remain  a  somewhat  crude  measure  of  "certainty.  " 

It  may  help  to  note  that  specifying  an  initial  value  of  N  .  will  yield 
the  same  result  as  if  we  had  specfied  a  uniform  prior,  and  had  taken  then 
N,  -  N  observations  which  happened  to  yield  the  same  mean  values.  So 
that  specifying  a  value  of  N.,  is  in  a  sense  equivalent  to  having  taken 
N.  -  N  observations  starting  from  a  uniform  prior. 


2.  4  Methods  of  sampling 


Simulation  methods  will  be  used  extensively  in  the  following  work| 
it  is  therefore  important  to  have  a  method  of  sampling  from  the  beta 
prior.  Ideally,  the  prior  represents  completely  our  knowledge  about  the 
state  of  nature,  so  that  the  actual  process  is  only  a  "random  draw  from 
the  prior.  " 
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We  take  this  interpretation  literally,  and  test  all  our  heuristics  by  draw¬ 
ing  many  sample  processes  from  the  prior  and  evaluating  the  heuristics 
with  each  one  of  the  possible  states  of  nature  thus  obtained.  The  sampling 
procedure  is  an  approximation  to  the  usual  Bayesian  technique  of  pre- 
posterior  analysis  vdiich,  in  the  examples  we  shall  study,  is  an  exceedingly 
complex  mathematical  task. 


E.  A.  Silver  (3)  has  shown  that  to  randomly  draw  from  fp  (  P  i  m), 
we  can  take  independent  draws  (y  ..)  from  the  N  simple  Gamm, .  distribu¬ 
tions  : 


f^(yijl-i..  q)  = 


_q _ 

r(m,  I 


m. .  -  1  -  q  y . . 


0  <  y  <00 


and  then  p  . .  = 
ij 


If  all  the  m..  are  integral,  the  Gamma  becomes  a  simple  Erlang-m. 


Mosiman  (2)  shows  that  to  sample  from  the  Erlang-m..  : 

m  .  m . . 

ij  ij 

1  V  I  .  I  1  V  . 


''ij  = 


In  r 


In  r  .  , 
1 


where  r  .  is  drawn  from  a  rectangular  distribution  on  [  0,1].  Since  we 
divide  by  ^  ^ 'j  ’  scale  factor  q  is  irrelevant,  and  can  be  chosen 


to  keep  the  logarithms  in  a  convenient  range  (we  use  q  =  20.  )  To  be 

certain  that  all  the  m..  are  integral,  we  truncate  the  actual  values.  This 

ij 

is  necessary,  as  Mosiman  also  states  that  there  is  no  way  to  sample  from 
a  general  Gamma,  short  of  using  tables  of  the  incomplete  Gamma  function. 


The  subroutine  GEN  in  Appendix  A  shows  a  programmed  version 
of  this  sampling  routine. 
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CHAPTER  III 


A  SPECIAL  TWO  STATE  PROBLEM 

3.  1  The  Problem 

We  now  have  a  notation  for  expressing  our  derision  problem,  and 
a  prior  distribution  to  encode  our  knowledge  about  the  unknown  transition 
probabilities.  Let  us  now  begin  to  formulate  the  general  decision  problem 
described  in  the  introduction. 

We  initiate  this  task  by  considering  a  very  simple  special  case,  pre¬ 
paratory  to  the  formulation  of  the  most  general  case  in  Chapter  IV. 

Consider  the  following  t^vo-state  discrete  time  Markov  process. 

When  in  state  1  there  is  no  choice;  the  transition  probabilities  are 
(  Pll#  ^  and  the  transition  rewards  ^^2)* 

There  are  two  alternatives  when  in  state  2;  alternative  1  having  transition 

12  12 

probabilities  (  ^21’  ^22^  "  3/4 )  with  rewards  ^22^’ 

'^2  *^2 

alternative  2  having  the  unknown  transition  probabilities  {  Poi*  P,,)  with 

22  21  22 
kiiown  rewards  (r  .,  r  ).  This  process  can  be  represented  by  the  follow- 

M  X  MM 

ing  diagram: 
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'^2  2 

We  assume  that  the  random  variables  p  and  p  are  multi- 

~  2 

dimensional  beta  distributed.  For  simplicity,  we  will  denote  p^j^ 
simply  by  p  for  the  remainder  of  this  chapter,  as  there  are  no  other 

unknown  probabilities.  ^  — P)*  We  denote  the  beta  parameter 

2  Z  2  2  2 

m  simply  by  r,  and  the  parameter  N  simply  by  n.  (m  =N_-m  ) 

The  decision  problem  now  becomes  to  specify  for  each  set  of 
parameters  (  r,  n)  whether  to  follow  alternative  1  or  alternative  2  in 
state  2. 

3.  2  Formulation  by  Dynamic  Programming 
3.2.1  The  Dynamic  Programming  Approach 

The  solution  of  a  problem  of  this  type  depends  upon  all  possible 
outcomes  and  all  decisions  in  the  future.  The  outcome  and  decision  tree, 
however,  is  infinite  since  the  process  will  operate  for  an  indefinitely  long 
time.  Discounting  reduces  the  importance  of  the  future,  but  we  are 
essentially  dealing  with  a  boundary  value  problem  with  the  boundary  of 
infinity.  The  only  technique  presently  known  for  solving  such  a  problem 
is  that  of  truncation  to  a  finite  outcome  decision  space.  This  finite 
problem  can  be  solved  by  dynamic  programming,  and  the  size  of  the 
finite  case  taken  large  enough  to  yield  a  converging  approximation  to 
the  infinite  case.  Usually  this  must  be  done  numerically,  though  it  may 
be  possible  to  discover  certain  analytic  properties  of  the  optimum  solu  ¬ 
tion  which  enable  us  to  find  the  optimum  by  another,  possibly  analytic, 
method. 

3,  2.  2.  Terminal  Policy 

In  order  to  formulate  the  problem  we  will  first  introduce  a 
terminal  situation.  Suppose  we  are  given  a  process  which  will  operate 
for  an  indefinitely  long  period  of  time.  While  the  process  operates  we 
must  make  a  decision  every  time  that  the  process  enters  any  state  with 
more  than  one  alternative.  We  will  observe  the  outcomes  and  improve 
our  decisions  in  time.  However,  suppose  that  at  some  preknown  time 
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we  will  make  a  terminal  policy  decision  which  will  be  used  throughout 
the  remainder  of  the  operation.  This  terminal  decision  policy  will 
specify  not  only  an  alternative  for  the  state  currently  occupied,  but 
also  alternative  decisions  for  all  states  which  might  be  entered  in  the 
future.  The  alternative  to  be  chosen  at  each  state  is  always  fixed  (i.e. 
the  terminai  policy  is  time -invariant. )  Each  possible  terminal  policy 
decision  has  a  terminal  value  associated  with  it. 

3.  2.  3.  The  Decision  Stage 

The  time  from  the  start  of  the  process  until  the  terminal  decision 
will  consist  of  a  fixed  number  of  decision  stages.  One  possible  choice 
for  the  decision  stage  would  be  the  period  (holding  time)  of  the  process. 

But,  as  decisions  are  only  made  when  in  state  2  for  this  simple  process, 
it  will  be  convenient  to  define  the  decision  stage  to  be  the  time  between 
consecutive  departures  from  state  2.  This  time  is  random  in  terms  of 
the  number  of  transitions,  but  this  disadvantage  is  more  than  offset  by 
the  fact  that  with  this  choice  we  know  precisely  how  many  observations 
of  the  transition  whose  probability  is  unknown  are  taken  per  decision 
stage  - -precisely  one.  Anytime  the  process  enters  state  I  it  will  be 
regarded  as  having  the  decision  stage  number  of  the  next  entry  into  state 
2. 

3.2.4  Dynamic  Programming  Equations 

At  the  0th  stage  we  will  choose  the  policy  with  the  highest  expected 
N 

terminal  value.  Let  (r,n)  denote  that  expected  present  discounted 

value  of  the  infinitely  long  income  stream  from  the  process  given  that  an 
optimal  policy  is  followed,  the  system  is  in  state  2  and  N  more  decisions 
are  to  be  made  before  the  terminal  decision.  This  quantity  is  obviously 
a  function  of  the  prior  parameters  of  the  unknown  p.  We  use  a  capital  V 
for  this  value  function  which  has  prior  parameters  as  arguments  to  distin¬ 
guish  it  from  the  function  lower  case  v  which  is  the  value  function  when 
the  probabilities  are  known.  We  can  always  compute  v  from  the  equations 
(see  Chapter  I).  =  [  I  -  P  P]  •  Capital  V  is  another  matter.  At 

the  0th  .“^tage,  we  can  then  write: 
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(r,  n) 


=  Max  {  V  T  (  P  ), 


I 


r,  n)  d  P  } 


where  P  denotes  the  square  transition  probability  matrix  associated 
with  policy  k  (  i.e.  follow  alternative  k  when  in  state  2.)  The  first  term 
inside  the  bracket  is  the  value,  when  in  state  2,  of  following  policy  1.  The 
second  term  is  the  expected  value,  when  in  state  2,  of  following  policy  2. 

N 

We  can  next  write  expressions  for  V  ^  (  r,  n)  in  terms  of  similar 

functions  for  (  N  -  1  ).  Given  that  we  are  in  state  2  with  N  stages  left, 
vidth  probability  p  we  will  be  in  state  1  with  (  N  -  1  )  stages  left,  and  with 
probability  1  -p  we  will  be  in  state  2  with  (N-1  )  stages  to  go.  Taking  the 
expectation  over  p  (  recall  that  p  =  r/n),  we  find: 


W  (  r ,  n )  =  Max  {  v  (  P  ^  ) ,  —  [  r  ^  + 


,N 

2 


n 


21 


p  Vj^‘^  (  r  +  1,  n  +  r] 


+  (1  -  [  ^22'*’^  ^(r,n+l)]  } 


Notice  that  the  first  expression  inside  the  braces,  which  corres¬ 
ponds  to  taking  alternative  1,  is  a  very  simple  expression.  This  is  be¬ 
cause  once  we  decide  to  follow  the  alternative  with  known  transition 
probabilities,  we  get  no  new  information  about  the  unknown  probabilities, 
and  therefore  there  will  never  be  any  reason  to  change  our  minds  about 
following  the  unknown  alternative,  that  is,  once  we  start  to  follow  alter¬ 
native  1  we  follow  it  forever.  Thus,  the  appropriate  expression  to  enter 
for  the  choice  of  alternative  1  in  our  expression,  is  simply  the  present 
value  of  the  income  we  would  get  by  following  that  alternative  forever. 

This  is  denoted  by  v  (  P  ),  and  is  computed  from  the  equations  of 

c* 

Chapter  I. 

N 

We  can  also  write  an  expression  for  V  (  r,  n  )  in  terms  oi 

N  ^ 

(r.  n  )  so  that  our  expression  involves  only  one  unknown.  Since  state 

1  has  known  probabilities. 
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Vj  (r.n)  = 


Pll  Ml  ^  Pl2  t  ^12  ^  ^  ^2 


(1-  P  P^) 


Moreover,  in  the  equation  for  (r,n),  the  superscripts  are 

superfluous,  since  our  definition  of  the  decision  stage  forces  N  +  n  to  be  a 
constant.  Thus,  the  fact  that  the  right  side  is  superscripted  (N  -  1  )  is  already 
conveyed  by  the  fact  that  the  argument  is  (  n+  1  ).  So,  we  can  drop  the  super¬ 
scripts  and  write  our  dynamic  programming  equations  as: 

V  {  r,  n )  =  Max  |v  {pS.  [”  +  a  V  (r+l,  n+l)), 

2  2  n  1  2  2 


+  n  -  —  )  (a^  +  a^  (r,n+  1))] 


where  again  v-  (P  )  is  the  value  of  following  the  known  alternative  1,  and: 


Ml  + 


(1-  P  Pij) 


<  Ml  Mi^  Pi2  M2^ 


Pl2 

1  - p  Pll 


a4=  P 

3.  3  Knowledge  Space 

The  function  (r,n)  is  defined  over  a  two  dimensional  space,  and 
may  be  plotted  in  three  dimensions.  However,  what  is  critical  for  decision 
purposes  is  the  plot  of  the  points  where  the  two  expressions  in  the  maximum 
above  are  equal.  On  one  side  of  this  boundary,  we  will  always  choose  alterna¬ 
tive  one,  and  on  the  other  side  we  will  always  choose  alternative  two.  Thus, 
given  this  boundary,  and  knowing  which  side  of  it  corresponds  to  alternative 

one,  our  decision  problem  is  solved.  Given  any  state  of  knowledge  (  r,  n)  we 
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merely  look  at  our  graph  and  see  which  alternative  to  follow.  In  the  next 
sections,  we  will  give  a  method  for  computing  this  boundary,  but  let  us 
first  pause  to  investigate  which  side  of  the  boundary  corresponds  to  which 
policy. 

Under  certainty,  the  value  as  a  function  of  the  transition  probability 
in  state  2,  p,  can  be  expressed  as: 

a  p  +  a  (  1  -  p) 

V  (p)  =  - - 

1  “^2  P“^4 

where  the  a.  are,  as  defined  in  the  last  section.  Then: 

1 

d  (aj-a^)(l-a^)-a^(a^-a2) 

=  .  __ 

dp  [l-a^p-a^ll-p)] 

The  denominator  of  this  expression  is  always  positive,  so  the 
derivative  takes  the  sign  of  the  numerator,  which  is  a  constant.  That  is, 
the  value  function  monotonically  increases  or  decreases  as  a  function  of 
p.  Thus,  the  policy  with  known  transition  probabilities  may  be  above  or 
below  the  boundary,  but  this  may  be  determined  by  inspecting  the  sign  of 
the  constant  numerator  above  for  any  problem.  If  the  derivative  is  positive 
the  "known”  policy  would  be  followed  when  we  had  states  of  knowledge  lying 
above  the  decision  boundary. 

3.  4  Numerical  Computation  Using  Dynamic  Programming 

Suppose  that  for  some  N  and  for  all  r  we  know  the  values  of  {  r,  li). 
Inspection  of  the  dynamic  programming  equations  then  reveals  that  we  can 
compute  the  value  of  V  ^  (  r,  N  -  ’  1  from  these  for  all  r,  and  so  on  down  to 
(  r,  0).  This  can  be  done  by  simple  calculation  since  the 
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entire  right  hand  side  of  the  equation  is  constant  except  for  values  of 
V  {  r,  N)  which  we  assume  we  know.  Thus,  the  entire  problem  re- 
duces  to  finding  the  values  along  some  value  n.  If  we  were  approx¬ 
imating  the  infinite  case  by  a  finite  duration  case,  we  could  just  assign 
termination  values  at  some  large  value  of  n  and  iterate  back,  but 
finding  the  termination  values  discussed  in  Section  3.  2.4  is  itself  a 
very  difficult  problem.  A  much  simpler  approximation  is  to  say  that 

for  some  large  value  of  n,  say  n=1000,  we  (virtually)  know  the  value 

r 

of  p  with  certainty  as  being  —  .  Thus, 

V  2  (  r .  N )  =  Max  (P^)} 

where  P  ^  is  the  known  matrix  with  p  =  r/N. 

We  have  used  this  computational  device  to  evaluate  the  deci¬ 
sion  boundaries.  The  use  of  discounting  assures  us  that  if  we  take  N 
large,  the  terminal  values  assigned  there  will  not  make  very  much 
difference  in  any  case,  and  experience  has  shown  that  values  of  N 
around  50  yields  results  almost  identical  to  values  around  1000. 

3.  5  Computation  Results 

A  compute  r  program  was  written  to  find  the  solution  to  this 
two  state  problem.  Its  specific  objectives  were  to  illustrate  the  shape 
of  the  decision  boundary,  the  speed  of  convergence  with  N  (the  assumed 
"infinity"),  the  sensitivity  of  the  decision  boundary  to  the  discount  factor, 
and  the  shape  of  the  isovalue  curves  in  the  (r,  n)  plane. 

Several  problems  were  run  with  different  values  of  P,  and 

N  to  illustrate  these  various  properties. 

For  the  first  6  problems  the  decision  boundary  points  are 
plotted  in  the  (r,  n)  space  (see  graphs  following).  These  are  the  points 
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where  it  is  f'rst  best  to  pick  the  alternative  1  when  in  slate  2.  jr 
instance,  in  the  second  problem  (  see  graph  2 ) ,  if  we  had  n  =  10, 
r  =  8,  we  would  decide  to  follow  policy  2  in  the  next  transition.  How¬ 
ever,  if  n  =  10,  r  =  9,  we  would  follow  policy  1  in  the  next  transition 
(and  hence,  jrever,  since  r  and  n  will  not  change).  The  problems 
and  parameters  are: 


1) 

p  = 

.9,  N  =  50.  =  4,  =  ± 

2) 

P  = 

.9999,  N=50,  P„  =  4,  p'^  =  4 

3) 

P  = 

.9,  N  =  50,  P„  =  4,  P*^  =  4 

4) 

P  = 

.9999,  N  =  50,  Pj,  =  4,  P^,  =4 

5) 

P  = 

.9.  N  =  50,  Pjj  =  4,  P^  =  4 

6) 

P  = 

.9,  N  =  1000,  Pj,  =  4,  P*^  =  4 

The  purpose  of  the  first  6  cases  is  to  illustrate  the  shape  of 
the  decision  boundary  and  how  it  changes  with  the  value  of  the  discount 
factor  p.  Notice  that  for  small  values  of  p  the  decision  points  stay 
very  close  to  the  boundary  under  certainty  (  straight  line  below  the  points). 
Only  for  larger  p,  where  the  future  has  great  importance,  does  the 
boundary  move  away  from  the  certainty  case.  In  the  limit  (  as  p  approaches 
one)  the  boundary  approaches  the  45^  line--experiment  forever. 

Comparison  of  Figure  1  and  Figure  6  illustrates  the  speed  of 
convergence  with  N  for  p  =  .  9.  Figure  6  also  shows  the  asymptotic 
behavior  of  the  decision  boundary.  Notice  that  the  decision  points  remain 
the  same  for  N  =  1000  and  N  =  50  except  for  those  near  50.  For  N  =  1000 
the  decision  boundary  remains  parallel  to  the  certainty  boundary  for  as 
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far  out  as  we  have  investigated. 

1  he  next  series  of  graphs  is  concerned  with  a  continuous 
decision  boa  dary  and  isovalue  curves  obtained  by  linear  interpola¬ 
tion.  The  significance  of  interpolated  values  is  that  if  the  value 
function  V  (r,  n)  is  defined  for  all  r  and  n,  and  is  relatively  smooth 
and  flat,  then  the  values  can  be  approximated  by  interpolation.  Graph 
rfV  shows  V  (r,  n)  plotted  against  r  for  several  values  of  n.  Apparently 
the  value  function  is  indeed  smooth  and  flat  enough  to  justify  interpola¬ 
tion  between  the  values  on  the  lattice.  However,  since  at  the  assumed 
horizon  n  is  an  integer  (  =  50  )  and  r  is  also  taken  to  be  an  integer  for 
decision  purposes;  the  results  that  follow  do  not  have  much  operational 
meaning,  and  are  helpful  only  to  get  a  rough  idea  of  the  bahavior  of  the 
val  ue  V  2  (  r ,  n  ) . 

The  next  graph,  #8,  shows  the  decision  boundary  found  by  inter¬ 
polation  for  three  values  of  (3.  The  significant  feature  of  these  curves 
is  their  "lumpy "  nature .  The  end  points  of  the  lumps  occur  at  integers, 
where  decisions  could  be  made. 

The  next  two  graphs  show  isovalue  curves.  Isovalue  curves  which 
are  for  values  of  V  (r,  n)  close  to  the  n-axis  appear  to  be  straight  lines 

C0 

and  have  a  slope  which  is  very  close  to  the  slope  that  the  value  under 
certainty  function  has.  However,  they  do  not  pass  through  the  origin  when 
extrapolated.  Rather,  they  have  a  small  positive  intercept.  For  isovalue 
curves  with  values  close  to  the  certainty  boundary,  though,  the  lumpy 
appearance  shows  up  again.  We  have  not  been  able  to  provide  a  satisfac¬ 
tory  explanation  for  this  phenomenon. 
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Conclusion 


3.  6 

In  this  chapter  we  have  considered  the  exact  solution  to  a 
eimple  two  state  case.  We  have  shown  the  dynamic  programming 
method  of  solution,  and  the  numerical  solutions  for  several  values 
of  the  problem  parameters.  The  conclusions  drawn  from  this  work 
are : 

1)  Whe  n  P  -  .  9  or  smaller,  the  value  under  certainty  gives 
a  good  approximation  to  the  value  under  uncertainty  for  n 
greater  than  15  or  20.  The  isovalue  lines  are  displaced 
upward  slightly  from  the  isovalue  lines  of  the  certainty 
case,  but  their  slope  is  undisturbed. 

2)  As  p  is  increased  from  .  9,  the  effects  of  uncertainty  are 
more  pronounced.  The  decision  boundary  bulges  upward 
and  becomes  lumpy.  Isovalue  lines  near  the  boundary  are 
similarly  affected,  though  those  far  from  the  certainty 
boundary  remain  remarkably  straight  lines. 

Now  that  we  have  considered  the  nature  of  an  exact  solution 
for  a  simple  case,  we  will  discuss  in  the  next  section  the  difficulties 
encountered  in  trying  to  apply  this  method  to  even  a  slightly  larger 
problem. 
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CHAPTER  IV 


INFEASIBILITY  OF  A  DYNAMIC  PROGRAMMING 
SOLUTION  FOR  AN  N-STATE  PROCESS 

4.  1  Terminal  Analysis 

We  demonstrate  the  infeasibility  of  dynamic  programming  by 
formulating  the  equations  for  the  general  case,  and  showing  that  their 
solution  requires  a  prohibitive  amount  of  calculation. 

As  in  the  special  two  state  example  presented  in  the  last  chapter, 

we  approximate  the  infinite  case  by  a  long  finite  process  with  terminal 

rewards.  We  calculate  the  terminal  rewards  by  assuming  that  we  must 

th 

choose  a  policy  at  the  0  stage  which  we  will  follow  forever. 

Exactly  as  in  the  two  state  case  we  define: 

„L,  12  ‘'ilk  ‘'n. 

as  the  present  discounted  value  of  the  infinite  reward  stream  under  the 
optimal  strategy  given  that  we  are  now  in  state  i  with  L  decisions  left 
before  the  terminal  decision.  The  decision  stage  is  now  defined  as  just 
the  period  (holding  time)  of  the  process,  since  a  decision  must  ir.  general 
be  made  in  each  state. 

This  value  is  obviously  a  function  of  all  our  present  knowledge,  so 
all  prior  parameters  appear.  For  simplicity,  we  denote  the  matrix  of 
prior  parameters  by  M,  and  write  V.^  (M). 

Finally,  at  the  terminal  stage  we  assign  a  policy,  and  the  present 
discouri‘^?d  value  of  a  policy  given  the  transition  probabilities  can  be  computed 
by  the  formulas  of  Chapter  I.  So: 

(  I  M)  1 
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where,  as  usual,  v  (P  )  is  the  present  discounted  value  of  being 
in  state  i  under  policy  A  when  P  is  known. 


4.  Z  The  Recursive  Equations 


If  we  are  now  in  state  i  and  following  alternative  k  with  L 


stages  left  before  the  terminal  decisions,  we  can  go  to  any  state  j 
/vk 

,ith  probability  p..  •  If  we  go  to  j,  we  earn  r  ..  immediately,  but 


ij 


as  if  next  period  we  can  expect  V  . 

J 


L-1 


(M'),  where  M'  is  the  matrix 

k 


of  posterior  parameters  given  the  transition  from  i  to  j.  Let  I 


ij 


denote  a  matrix  of  zeros  except  for  a  single  one  in  the  position  corre¬ 


sponding  to  m  . .  .  Then: 


V  .  (  M )  =  Max 

^  k 


N 


r  V  ■^k  r  k 
P.^  [  r.. 


J 


=  1 


+  ^  (M+I^.  )]  ] 

J  U 


This  expectation  is  .seen  to  be  linear  in  p..  ,  30  we  can  just  use  the 


statistics  of  the  expected  process.  Thus: 

N 


(M)  = 

1  k 


These  equations  closely  resemble  the  ordinary  equations  for  deter¬ 
mining  an  optimal  policy  with  known  transition  probabilities,  except 
that  now  V  ^  ( •)  has  different  arguments  on  the  two  sides  of  tne  equa¬ 
tions  . 


4.  3 


Conclusion" 


There  are  difficulties  inherent  in  solving  both  the  terminal 
and  the  recursive  dynamic  programming  equations.  The  form; 

V.  (M)  =  E  [  V.  (  P  )  I  M] 

1  1 
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is  itself  virtually  intractable,  as  it  involves  a  very  complex  multiple 
integration.  But,  as  will  be  discussed  in  the  next  chapter,  it  may  be 
appropriate  to  say  that: 


E  [  V  ,  (  P^)  1  M]  =  vM  P^) 


where  P  is  the  matrix  of  policy  A  in  the  expected  process.  In  this 
case,  the  terminal  equations  could  in  fact  be  solved  by  ordinary  policy 
iteration. 

0 

Given  V.  (M)  for  all  values  of  M,  there  is  no  theoretical 
1 

difficulty  in  solving  the  rest  of  the  problem.  The  only  trouble  is  one 

N 

of  dimensionality.  M  is  a  (  ^  x  N)  matrix.,  and  the  state  super- 

j  =  l 

script  i  can  run  from  1  to  N.  If  we  set  even  so  modest  a  task  as  to 
tabulate  for  each  integral  m_  in  the  interval  (  1,  100),  we  find  that 
each  decision  scage  involves: 


N  .  (  100  ) 


Nk  . 
J 


tabulations.  Even  for  very  small  problems  this  is  impossible.  For 

24 

example,  our  taxicab  example  would  require  3x100  tabulations  for 

each  decision  stage.  To  reach  convergence  would  require  at  Least  100 
50 

stages,  so  3x10  calculations  would  be  required. 

But  even  though  a  dynamic  programming  solution  is  infeasible, 
it  does  provide  us  with  a  compact  formulation  of  the  problem,  and  a 
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i  lear  notion  of  what  a  solution  is.  If  we  let  L  grow  very  large  so  that 
L  becomes  indistinguishable  from  (L  i  ),  then  the  problem  can  be 

aied . 

For  every  possible  prior  matrix  M  and  every  state  i,  specify 
❖ 

that,  alternative  k  such  that: 


V  .  (  M)  = 


Max  r  k 

k  t  ’i  ^ 


V  (M+I^  )1 

J 
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CHAPTER  V 


DISTRIBUTION  OF  LONG  RUN  AVERAGE  GAIN 


5.  1  Steady  state  probabilities 


One  measure  of  effectiveness  for  a  Markov  process  without  dis¬ 
counting  is  the  long-run  average  gain,  which  is  defined  as  the  sum  of  the 
products  of  the  immediate  rewards  by  the  corresponding  steady  state 
probabilities : 

s  =  I  ^  “Ji 

i 

Both  the  q's  and  the  w's  depend  on  the  transition  probabilities,  so  that  in 
the  case  we  are  studying,  they  are  both  random  variables. 

E.  A.  Silver  has  shown  that  the  E  (  ^j)'®  approximaied  by 

the  7r  .  ’3  of  the  expected  process  when  the  transition  probabilities  are 

^  (3) 

multidimensional  beta  distributed.  The  accuracy  of  this  approximation 
was  shown  by  using  simulation  techniques.  His  results  also  indicated  that 
the  approximation  became  better  as  the  N.  's  increased. 

This  result  sheds  partial  light  upon  our  problem,  but  does  not  go 
quite  far  enough.  Our  primary  interest  is  in  the  product  of  the  *vi  's  with 
the  'q 's  ,  and  it  is  the  behavior  of  this  statistic  that  we  shall  study  in  the 
following  sections. 

5 .  2  Distrib  ion  of  gain  for  a  given  policy 

In  order  to  know  how  bad  it  might  be  not  to  follow  the  optimal 
policy,  it  is  useful  to  have  some  idea  of; 
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a)  What  the  distribution  of  gain  is  like  for  a  given  policy,  and 

b)  How  that  distribution  differs  for  various  policies? 

The  second  question  could  be  answered  by  the  first  if  we  found  the 
distribution  of  gain  explicitly  in  terms  of  the  prior  parameters  of  the 
policy . 

It  is  necessary  to  point  out  that  throughout  this  chapter  we  will  be 
dealing  with  terminal  analysis;  that  is,  the  same  policy  will  bo  followed 
forever.  This  is  a  first  step  towards  the  analysis  of  the  general  problem 
which  allows  the  policy  to  be  changed  at  any  time. 

5.  2.  1  Simulation  runs 


In  order  to  study  the  distribution  of  the  gain,  a  sampling  program  was 
written.  This  program  receives  as  data  the  prior  parameters  of  a  Markov 
process  with  one  alternative  in  each  state  (thus  there  is  only  one  policy  ;  a 
necessary  restriction  here  since  gain  is  associated  with  a  particular  policy). 
The  program  produces  sample  processes  from  the  prior  parameters  and 
computes  the  gain  of  each  sample  process.  The  range  of  the  gain  is  divided 
into  several  intervals  and  for  each  interval  a  count  is  kept  of  the  sample  gains 
in  that  range.  This  provides  a  histogram  of  the  distribution.  Besides,  a 
sample  mean  and  variance  is  computed.  (See  Appendix  A.  14) 

We  have  studied  five  particular  processes.  The  first  is  a  five -state 
process  with  probabilities  and  rewards  drawn  from  a  random  number  table. 
The  other  four  were  selected  policies  of  the  three -state  taxicab  problem. 

Each  time  a  process  was  studied,  the  prior  parameters  N.  were  the 
same  for  all  rows  (states).  For  different  computer  runs,  different  values 
for  the  N  .  were  selected.  Thus,  the  behavior  of  each  process  was  studied 

i 
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for  four  different  values  of  the  N.  :  25,  50,  100,  and  150.  There  were 

1 

also  three  miscellaneous  runs.  One  was  the  five -state  problem  but  the 
rewards  (which  had  a  range  of  zero  to  ten)  were  replaced  by  their  ten's 
complement.  The  second  miscellaneous  process  was  the  regular  five- 
state  problem,  but  with  different  values  of  N.  for  the  various  states 
(the  average  value  was  100) .  The  last  run  was  policy  (2,2,2)  of  the 
taxicab  problem,  but  only  25  sample  processes  were  drawn  to  see  how 
accurate  statistics  could  be  obtained  by  very  little  sampling.  There  were 
23  runs  in  all. 

5.2.2  Results  of  the  . mulation 

The  sampling  experiments  described  above  yielded  only  limited 
evidence  of  the  system  behavior.  The  experiments  were  in  most  respects 
exploratory  rather  than  aimed  at  testing  any  one  particular  hypothesis. 

Even  so,  there  are  certain  results  that  seem  significant  enough  to  be  men¬ 
tioned  here.  The  following  conclusions  are  substantiated  by  the  accompany¬ 
ing  graphs  and  tables. 

E.  A.  Silver  has  shown  that  the  general  form  for  the  's  involves 

N- 1  ^ 

N  cross  product  terms  in  the  numerator  (for  an  N  state  process),  and 

N 

N  cross  product  terms  in  the  denominator.  We  originally  hoped  that  for 
large  processes,  the  law  of  large  numbers  might  come  into  play,  and  that 
the  distribution  of  the  gain  would  be  approximately  normal.  This  was  one 
of  the  reasons  we  tabulated  a  sample  mean  and  variance  wdien  plotting  our 
histogram. 

a.  )  When  the  distribution  of  the  gain  is  plotted  on  probability  paper 
it  does  approximate  the  normal  for  larger  processes.  Also,  the  larger  the 
value  of  N  ^  ,  the  more  normal  the  distribution  appears.  The  mean  and 
variance  of  the  normal  are  very  close  to  the  sample  mean  and  variance  com 
puted.  But,  however  encouraging  this  result,  it  should  be  mentioned  that  a 
chi-square  analysis  hows  quite  decidedly  that  the  distribution  is  not  normal 
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[n  other  words,  the  normal  is  clearly  only  an  approximation.  Since  the 
N 

N  terms  mentioned  above  are  not  independent,  the  law  of  large  numbers 
iiet.d  not  apply,  even  for  very  large  processes. 

b.  )  For  a  given  process,  the  product  of  the  sample  variance  and 
Lite  common  value  of  the  N  ^ 's  appears  to  be  more  or  less  constant.  This 
suggests  tliat  given  the  constant  (which  is  seen  to  vary  from  process  to 
process),  we  could  predict  the  variance  merely  by  knowing  the  prior  para¬ 
meters  N.  when  they  are  the  same  in  every  state.  Our  miscellaneous  run 
indicates  that  the  statistics  do  not  change  much  if  the  are  slightly 
different- -we  can  just  use  their  mean  value. 

c.  )  The  mean  value  of  the  gam  is  approximated  well  by  the  gain 

of  the  expected  process,  and  this  approximation  becomes  better  and  better 

as  the  prior  parameters  N.  increase.  Tin:  is  an  extension  of  E. A.  Silver's 

1 

-«!sult  that  the  above  is  true  for  the  n  ,  alone. 

J 

d.  )  Sometimes  the  gain  of  the  expected  process  is  larger  than  the 
sample  mean,  and  other  times  it  is  smaller.  The  two  ca^es  are  illustrated 
by  the  case  where  we  took  10 's  complement  of  all  rewards.  We  have  deter¬ 
mined  no  a  priori  way  of  determining  which  case  will  pertain  in  a  particular 
-process. 
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RESULTS  OF  SIMULATION  OF  TAXICAB  EXAMPLE 


500  sample  processes  for  each  simulation 


/w  ^  rv 


Policy 

_N 

E  (k) 

Var  (  g) 

N*  Var  (g) 

2,  2,  2 

25 

13.  34 

14.  04 

1 . 3047 

32.  5 

50 

13.  42 

.6399 

32.0 

100 

13.  39 

.  3401 

M.O 

150 

13.  44 

.  2458 

36.  8 

1,2.  2 

25 

13.  15 

14.04 

1 .  1375 

28.  4 

50 

13.  31 

.  6784 

33. 9 

100 

)  3.  23 

.  3098 

31.0 

150 

13.  32 

.2117 

30. 4 

2.  1.2 

25 

8.  81 

8.  67 

.  1107 

2.  27 

50 

8.  72 

.0504 

2.  52 

100 

8.  79 

.  0228 

2.  28 

150 

8.  79 

.0201 

3.01  (only  62 

samples) 

2.  2.  1 

25 

12.  89 

13.  93 

1.  2716 

31. 8 

50 

12.96 

.  7494 

37.  5 

100 

12.93 

.  4604 

46.0 

150 

12.  92 

.  2688 

40. 3 

2.  2.  2 

25 

13.  34 

13.  72 

1 . 5228 

(25  samples) 
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RESULTS  OF  SIMULATION  ON  FIVE -STATE  RANDOM  PROCESS 


Number  of 

Samples 

N 

S. 

E  (§) 

Var  (7) 

N  Var  (  g) 

500 

25 

4.  69 

4.  60 

.  0683 

1.  7075' 

455 

50 

4.  67 

.0365 

1.825 

220 

100 

4.  68 

.0167 

1.670 

(g) 

500 

150 

4.  69 

.0096 

1.440 

500 

different 

4.68 

.0155 

uve  = 

100 

"Reversed 

rewards " 

case 

500 

150 

5.38 

5.  39 

.0091 

E  (  g) 
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Cimi.  Prob.  (in  %) 


Limitations  and  suggestions  for  future  work 


5.  3 

In  concluding  this  chapter  we  would  like  to  point  out  some  of  the 
limitations  of  our  research  into  the  distribution  of  the  long  run  average 
gain. 

In  the  first  place,  we  have  studied  a  very  limited  number  of 
processes  (only  five),  and  they  have  been  of  small  size  (three -state  and 
five -state).  It  would  be  quite  desirable  to  study  nore  problems  of  larger 
size  to  test  the  generality  of  our  results. 

Second,  in  ail  the  processes  studied  (except  for  one)  all  the  N  .  's 
had  the  same  value,  and  only  four  different  values  were  considered.  A 
few  processes  with  larger  N.  ’s  should  be  studied,  as  well  as  more  pro¬ 
cesses  where  the  's  for  the  same  process  have  different  values  in 
different  states. 

Third,  we  did  not  study  the  effect  of  different  reward  structures 
on  the  same  transition  probability  structure.  Whether  the  two  can  be 
separated  is  a  question  which  should  be  studied. 

Fourth,  all  the  processes  studied  in  this  section  were  processes 
with  only  one  alternative  in  each  state.  It  would  be  interesting  to  investi¬ 
gate  the  general  case  with  alternatives.  There  would  be  two  ways  to 
approach  this.  The  first  would  be  to  take  some  selected  policies  and  study 
the  distribution  of  the  gain  for  each  one  separately  (  lo  some  extent,  we 
did  this  with  the  taxicab  problem.)  Another  way  would  be  to  take  the  prior 
parameters  for  the  entire  multi -alternative  prot^ess  and  to  sample  compld* 
multi -alter native  processes  from  this.  Each  sample  process  could  then  tie 
solved  for  the  optimal  policy,  and  sample  optimal  gams  could  be  recorded 
(and  perhaps  also  compared  against  the  sample  gams  of  some  fixed  p<Ti(.  les) 
A  good  number  of  sample  processes  would  be  needed  for  this  program -- many 
times  the  number  of  policies  (which  is  itself  a  very  large  number). 
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Fifth,  the  processes  studied  have  always  been  completely  ergodic. 

L  should  be  interesting  to  study  the  behavior  of  processes  with  transient 
states,  and/or  multiple  chains. 

Sixth,  we  mentioned  before  that  we  could  predict  the  variance  for 
larger  values  of  N  ^  ,  given  the  "constant"  product  of  some  and  its 
variance.  'V/e  should  note,  however,  that  in  actual  processes  vdien  N. 
changed  due  to  observations,  the  mean  values  change  also,  so  that  in 
practice  different  values  of  are  associated  with  different  expected 
processes.  We  have  not  shown  that  the  "constant"  remains  constant 
under  such  operations,  but  just  what  it  does  do  would  be  an  interesting 
question  for  study. 

5. 4  Conclusions 

With  a  1  the  limitations  pointed  out  in  the  last  section,  it  would 
be  very  presumptuous  to  state  any  general  normative  rules.  However, 
it  seems  safe  to  say  the  following: 

a. )  When  the  prior  parameters  are  large  enough  (and  150 
seems  large  enough  for  five  state  processes),  it  is  a  good  approximation 

to  assume  that  the  process  is  known  with  certainty  and  that  the  probabilities 
are  given  by  the  expected  process.  The  value  obtained  from  the  expected 
process  is  almost  identical  to  the  sample  mean  obtained  by  simulation,  and 
the  sample  variance  is  almost  negligible. 

b. )  Even  when  the  are  quite  small  (say  25  in  the  five  state  case), 

the  gain  of  the  expected  process  is  close  (say  1/2  standard  deviation)  to  the 

sample  mean,  and  the  standard  deviation  is  only  about  10%  of  the  mean 

-  1/2 

value.  Moreover,  this  figure  falls  as  F  . 
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CHAPTER  VI 


HEURISTIC  METHODS 
6.  1  General  considerations 

We  have  previously  discussed  the  analytic  difficulties  which  lead 
us  to  a  heuristic  approach.  In  this  chapter  vre  consider  some  approaches 
to  the  problem  which  are  empirical  or  intuitive  in  nature,  present  some 
experimental  results,  and  suggest  possible  extension  and  generalizations. 

The  basis  for  most  of  the  heuristics  is  that  some  sort  of  trade-off 
is  implicit  in  our  problems.  On  the  one  hand  is  the  immediate  expected 
reward  to  '  e  earned  from  the  process  if  we  follow  the  optimal  policy  of 
the  expected  process,  but  on  the  other  hand  is  the  possibility  of  finding 
a  still  bettei  policy  by  further  experimentation  with  relatively  unexplored 
alternatives.  It  will  be  noted  that,  with  the  exception  of  the  first  and 
simplest  heuristic,  all  of  the  suggested  approaches  have  one  or  more  para¬ 
meters  which  attempt  to  measure  this  trade-off. 
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6.  2  Follow  optimal  policy  for  expected  process 


This  is,  conceptually,  the  simplest  possible  heuristic.  After 
every  transition  we  perform  a  Bayesian  modification,  of  our  prior.  Then, 
using  the  matrix  of  expected  values,  we  determine  the  optimal  policy  to 
follow  for  the  next  transition. 

Obviously,  a  single  transition  is  unlikely  to  greatly  effect  the 
policy  decision,  and  it  would  be  expensive  to  re-examine  the  policy  every 
period.  So  for  purposes  of  experimentation,  we  chose  50  transitions  as 
a  convenient  and  reasonable  period  for  re-examination. 

Since  this  heuristic  seems  the  one  that  would  naturally  be  used 
in  the  absence  of  some  more  sophisticated  analysis,  it  is  worthwhile  to 
examine  it  here.  In  the  first  place,  the  trade-off  between  immediate 
gain  and  information  does  not  exist  in  this  heuristic.  There  is  no  mechan¬ 
ism  which  explicitly  forces  unexplored  policies  to  be  observed  in  early 
stages.  Therefore,  if  it  should  happen  that  there  is  some  very  good  policy 
which  a  priori  seemed  quite  bad,  it  is  entirely  possible  that  this  heuristic 
will  never  provide  the  information  needed  to  recognize  the  policy  as  being 
better  than  originally  thought. 

On  the  other  hand,  if  a  policy  looks  very  good  a  priori,  and  happens 
to  be  not  so  good  after  all,  the  heuristic  will  quickly  reveal  this.  Indeed, 
since  Bayesian  modification  of  the  prior  is  continually  taking  place,  this 
not-so-good  policy  will  soon  become  only  second  best  in  the  matrix  of 
expected  values. 

Thus,  this  first  heuristic  provides  one  kind  of  information,  but  not 
another.  If  a  good  policy  looks  bad,  we  may  never  find  this  out.  But  if 
a  Lad  policy  looks  good,  this  is  discovered  quickly.  In  practice,  even  the 
first  kind  of  information  may  be  obtained;  the  original  best  becomes  only 
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second  best  and  experimentation  begins  with  an  unexplored  policy,  which 
may  begin  to  look  better.  Moreover,  in  exploring  a  policy,  say  [  2,  2,  1  ]  . 
we  are  also  indirectly  exploring  policies  [2,  1,1],  [  1,2,  2],  etc. 

We  expect  then,  that  when  the  true  transition  probabilities  of  the 
process  are  likely  values  of  the  prior  distribution,  this  heuristic  method 
should  perform  very  well.  Kowever,  when  the  true  transition  probabilities 
are  in  fact  far  from  the  prior  expectations,  then  the  initial  policy  may 
well  be  a  poor  one,  and  may  even  fail  to  generate  the  information  very 
quickly  to  indicate  that  in  fact  it  is  an  inferior  policy. 

This  last  surmise  was  verified  by  deliberately  choosing  an  unlikely 
sample  point  from  a  prior  distribution,  so  that  the  best  p>olicy  of  the  actual 
process  looked  quite  poor,  a  priori.  The  process  and  prior  are  displayed 
as  Run  1  in  Appendix  B.  Note  that  in  15  iterations  50  observations  each, 
the  true  optimum  was  never  explored  or  discovered“-rather,  the  same 
apparent  optimum  was  chosen  every  time. 

What  is  needed  is  an  exhaustive  verification  of  the  first  assertion, 
that  in  an  expected  value  sense,  this  heuristic  will  provide  very  satisfactory 
results.  Simulation  with  a  large  number  of  sample  points  from  a  very 
large  number  of  priors  would  be  needed  to  establish  the  assertion  quanti¬ 
tatively. 

As  an  example  of  the  type  of  simulation  needed,  and  to  get  some  clues 
as  to  probable  outcomes,  we  constructed  Run  10  (Appendix  B  1.2).  Here  we 
utilized  a  prior  and  rewards  drawn  from  random  number  tables,  and  sim¬ 
ulated  for  five  sample  points.  The  results  were  surprisingly  good,  the 
optimal  policy  of  the  expected  process  was  almost  always  the  optimal  policy 
of  the  sample  process,  or  so  close  to  it  that  we  are  earning  96  +  %  of  what 
we  could  have  earned  with  perfect  informat' on. 
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This  figure  seems  to  be  quite  satisfactory,  especially  as  it  improves  with 
time  as  our  knowledge  becomes  more  complete. 

More  simulation  of  this  sort  should  be  carried  out.  But  we  turned 
our  attention  toward  heuristics  vdiich  would  meet  the  challenge  of  extreme 
points  from  the  prior. 

6.3  A  trade-off  between  immediate  reward  and  information 


We  will  now  establish  measures  for  immediate  reward  and  for  in¬ 
formation,  and  we  will  then  explain  our  proposed  trade-off  heuristic.  It 
has  always  been  noted  that  the  q.  can  be  interpreted  as  immediate  expected 
rewards.  And  a  rough  and  ready  estimate  of  information  inherent  in  an 

alternative  is  N .  =  )  m..  ,  the  larger  N.  ,  the  more  information  we  have 

1  ^  ij  ®  1 

j 

about  the  alternative. 

,  .  k  ,  k  min.  /..k 

We  now  define  the  quantity  w.  =  (q.  -  q.  )  a/N..  Here  a 

is  the  trade-off  constant.  The  so  aefined  w^  is  large  when  either  1)  the 

relative  immediate  reward  is  great,  or  Z)  the  current  information  is  scant. 

Thus,  our  heuristic  is  to  choose  a  policy  by  maiximizing  w^  in  each  state. 

We  then  observe  under  that  policy  for  a  number  of  transitions  NOBS  which 

is  proportional  to  the  smallest  of  the  w.  in  the  policy.  That  is,  if  the 

relative  immediate  rewards  are  all  very  great,  or  the  current  information 

in  each  state  is  very  small,  we  observe  for  a  long  time:  NOBS  =  p  »  w 

min 

This  heuristic  is  displayed  as  MAIN  Z  in  Appendix  A. 


Our  experience  with  this  heuristic  indicates  that  the  parameters  a 
and  p  are  quite  problem -dependent,  and  that  the  policiet;  selected  are  very 
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sensitive  to  changes  in  either  parameter.  For  a  solution  to  our  sample 
problem  using  this  heuristic  see  Runs  2,  3,  and  4  in  Appendix  B. 


6.  4  IVeighted  immediate  reward  technique 

k  k 

In  this  heuristic,  we  weight  each  by  a  factor  c  .  before  apply¬ 
ing  the  policy  iteration  algorithm  for  finding  the  optimal  policy.  The 
weighting  factor  is  similar  to  the  one  used  in  the  previous  section.  Nov': 

c^  =  (a  /n!')+  1. 

1  11 

Note  that  as  the  number  of  observed  transitions  becomes  very  large, 
the  weighting  factor  c^  approaches  1,  so  eventually  we  converge  to  the 
optimal  policy  of  the  expected  process. 


Having  found  the  optimal  policy  by  this  technique,  we  observe  under 


the  determined  policy  for  a  number  of  observations  NOBS  calculated  to 

bring  the  c  .  down  as  low  as  the  minimum  such  value  in  the  state.  This  is 
1 

done  to  minimize  the  chances  of  simply  re -determining  the  same  policy. 

To  be  explicit,  NOBS  for  the  policy  A  is  chosen  to  be  ^  x.  ,  where  the 

X  .  satisfy:  i 

1 


min  k 
k  ^  i 


a  q 


N^+  X 


for  all  i 


1  1 

It  may  be  that  x.  =  0  for  some  i,  which  means  that  the  c  .  associ- 

i  1 

ated  with  the  optimal  policy  is  already  the  lowest  in  the  state.  If  a  given 
alternative  appears  in  the  optimal  policy  in  spite  of  having  the  lowest  weight¬ 
ing  factor,  then  it  must  be  a  very  good  policy  indeed.  So,  in  this  case  we 
set  X.  =  P  ♦N.  That  is,  if  this  good  policy  has  a  lot  of  information  to 
back  it  up,  we  observe  for  a  long  time. 
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Runs  5-9  listed  in  detail  in  Appendix  B,  were  made  using  various 
values  for  a  and  p.  In  all  cases,  we  eventually  converge  on  the  expected 
optimum,  since  the  weighting  factor approach  1.  But  hopefully,  this 
heuristic  will  also  handle  extreme  points  from  the  prior.  The  problem 
then  becomes  a  speed  of  convergence  to  the  optimum. 


by 


To  get  a  measure  of  speed  of  convergence,  we  define  an  efficiency 


EPF  -  Expected  actual  earnings 

Expected  optimum  earnings 


Plots  of  efficiency  versus  accumulated  number  of  observations  are 
given  in  Appendix  B.  3  for  different  values  of  a  and  |3. 

It  can  be  seen  from  these  graphs  that  convergence  is  not  terribly 
rapid.  If  a  small  discount  factor  were  operative,  this  convergence  could 
be  quite  unsatisfactory. 

6.  5  Weighted  variance  technique 

This  heuristic  is  a  direct  attempt  to  measure  the  trade-off  between 
immediate  gains  and  information.  As  our  measure  of  information,  we  turn 
to  an  analysis  of  variance. 

We  first  look  for  an  expression  for  the  expected  decrease  in  the 

variance  of  the  marginal  distribution  of  a  p  . .  if  we  make  one  observation 

k 

of  the  process  let  us  denote  this  quantity  by  d..  .  This  quantity  gives 

^  J 

some  idea  about  how  much  will  be  learned  about  a  particular  transition 
probability  if  we  observe  one  transition.  But  some  transitions  are  more 
important  than  others;  namely,  those  with  high  rewards.  (This  is  only  a 
first  approximation,  as  those  which  put  us  in  states  with  high  expected 
returns  are  also  important).  Thus,  we  weight  each  d..  by  the  correspon- 

k  k  II 

ding  r  ^ j  to  obtain  a  row  sum  s  .  .  This  we  call  the  total  weighted 
expected  change  in  variance,  "and  assert  that  it  is  a  dollar  measure  of 
information  to  be  gained. 
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Next,  we  ask  how  we  can  measure  the  "cost  of  obtaining  informa - 

.ion.  "  Suppose  we  want  to  observe  a  particular  alternative  k  in  state 

i.  Let  0  denote  the  optimal  policy  for  the  system,  and  A  denote  the 

0  A 

best  policy  which  uses  alternative  k  in  state  i.  Then  g  -  g  provides 
a  measure  for  the  expected  loss  from  experimentation. 


Then  we  can  state  the  working  of  this  heuristic:  for  our  present 

Ic 

system  state  we  find  the  alternative  which  maximizes  s  .  ;  that  is,  the 

state  with  the  greatest  expected  information  gain.  UTe  then  find  the  cost 

0  A 

of  experimenting  by  computing  the  approximate  g  -  g  ,  and  compare 
a  weighted  value  of  s  .  to  this  difference  to  decide  \^ether  to  use  the 
optimum  alternative  of  the  expected  process,  or  whether  to  experiment 
to  gain  information  with  alternative  k. 

We  have  now  to  define  the  quantity  d_  ,  that  is,  the  expected 
decrease  in  variance  from  one  observation.  We  first  recall  the  Bayesian 
theorem  that 


mea;,.  of  posterior  variance  +  variance  of  posterior  mean  =  prior 
variance 


prior  variance  -  mean  of  posterior  variance  =  variance  of  posterior 
mean 


The  left  hand  side  of  this  equation  is  precisely  what  we  mean  by 
that 


d  . .  =  variance  of  posterior  mean 
ij 

This  is  a  general  result,  true  for  any  distribution  and  for  any  amount  of 
sampling.  Let  us  consider  the  case  at  hand,  a  multidimensional  beta  and 
one  sample.  Then  the  posterior  mean  is: 
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m  . .  +  r 

iJ _ 

N*'  +  1 
1 

where  r  has  a  beta-b’nomial  distribution.  The  required  variance  is: 


k  ,  k  k 
’n  . .  (  N  .  -  m  . . ) 

(N^)^  (nJ"  +  1 


Thus , 


is  defined  by: 


_ 1 _ 

{n|") ^  {n|"  +  1 


k  .  k 
m  . .  )  r  . . 
ij  iJ 


Again  we  mention  after  each  transition,  this  calculation  must  be 
made  for  the  state  currently  occupied  to  dete'-mine  whether  to  use  the 
optimal  or  experimental  alternative. 

It  should  be  noted  that  this  is  also  a  convergent  scheme  in  that 
eventually  all  of  the  variances  go  to  zero,  so  we  always  follow  the  "optimal” 
rather  than  the  experimental  alternative. 

6.  6  Suggestions  for  future  heuristics 

Unfortunately,  most  of  the  experimentation  with  the  heuristics 
discussed  in  this  chapter  took  place  early  in  the  project- -before  we  had 
a  clear  idea  of  just  v^at  information  we  wanted  to  get  from  our  simulations. 
From  our  present  point  of  view,  we  can  suggest  several  further  experiments 
which  should  be  performed: 

1. )  An  "expected  value"  evaluation  of  the  first  heuristic  (follow  optimal 
policy  for  expected  process)  must  be  developed.  This  will  involve  applying 
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the  heuristic  tc  many  random  draws  from  the  prior  to  measure  its 
effectiveness. 

2.)  We  feel  that  heuristics  2  and  3  are  inferior  because  their  para¬ 
meters  are  too  problem  dependent.  However,  it  may  be  that  if  a  class 
of  problems  is  to  be  solved,  it  would  pay  to  get  values  for  the  parameters 
for  some  members  of  the  class,  and  use  these  values  for  all  m.embers 
of  the  class.  When  appropriate  parameters  are  used,  convergence  is 
very  good,  but  we  are  in  doubt  as  to  whether  that  is  a  prior  or  posterior 
fact:  did  we  perhaps  just  find  parameters  which  happen  to  giv'e  good 
results  given  both  our  prior  and  the  sample  points  ? 

6.  6.  1  A  possible  heuristic 

The  investigation  on  the  distribution  of  the  average  long-run  gain 
described  in  the  last  chapter  has  given  rise  to  a  heuristic  v^ich  we  have 
not  yet  tested,  but  which  looks  like  it  might  be  effective. 

Recall  that  the  limitat  on  in  following  the  optimal  policy  of  the 
expected  process  was  that  we  considered  only  immediate  rewards,  and 
not  the  possibility  of  gaining  information.  We  consider  now  using  a 
modification  of  policy  iteration  to  find,  instead  of  only  the  optimal  policy 
of  the  expected  process,  the  10  best  or  so.  We  could  then  use  simulation 
or  the  predictive  results  of  the  last  chapter  to  estimate  both  the  mean  and 
variance  of  the  gain  for  each  of  these  policies.  This  would  give  us  measures 
of  both  immediate  gain  and  uncertainty,  so  a  trade-off  could  be  established 
between  them.  There  is  much  to  be  learned  about  a  high-variance  policy, 
and  much  to  be  gained  from  a  high-mean  policy. 

It  might  also  be  wise  to  check  for  excessively  high  variances  before 
entering  the  above  routine,  and  force  tests  to  eliminate  any  such. 
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As  is  demonstrated  in  Appendix  C,  it  is  impossible  to  modify  the 
policy  iteration  method  to  produce  the  10  best  policies;  but  Appendix  C 
continues  to  present  an  approxima.tion  procedure  ior  finding  N  of  the 
best  policies  in  an  N  state  process. 

There  are  several  experiments  which  we  can  suggest  as  useful 
for  evaluating  this  heuristic: 

1 .  )  By  simulation,  explore  how  much  actual  overlap  there  is  in  the 
range  of  g  for  different  "good  "  policies.  That  is,  would  a  "second  best" 
policy  have  a  good  chance  of  having  a  large  percentage  of  its  values  of 

g  being  above  the  g  oi  the  best  policy? 

2. )  Check  by  simulation  vs^ether  for  two  policies  A  and  B,  the  relation¬ 
ship  between  )  and  E  (  )  is  indeed  similar  to  the  relationship 

A  B 

between  g  and  g  .  If  this  is  not  the  case,  knowledge  of  the  "lO  best" 
policies  of  the  expected  process  does  not  tell  us  anything  about  the  ")0 
best"  of  the  primary  process.  Our  previous  experimentation  with  distri¬ 
bution  of  gain  suggests  that  the  required  relationship  will  hold. 

2 

3. )  Check,  low  g  and  high  c  ,  y/vhether  g  is  equally  likely  to  increase 

S 

or  decrease.  If  it  should  happen  to  decrease  more  often  than  increase 
(which  could  be  a  structural  property  of  this  type  of  process),  experimenta¬ 
tion  with  high-variance  policies  might  not  be  as  good  an  idea  as  it  appears. 

On  heuristic  grounds,  this  technique  seems  to  be  sound.  We 
conclude  with  a  summary  of  the  technique,  and  a  suggestion  that  it  be 
evaluated: 

1. )  Find  N  best  policies 

2. )  Determine  mean  and  variance  of  the  gain  for  each  policy. 

3. )  Choose  a  policy  by  maximizing  a  weighted  sum  of  mean  and 

variance. 

4. )  Use  that  policy  for  a  while;  then  up-date  prior  and  repeat. 
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CHAPTER  VII 


CONCLUSION 

The  most  basic  conclusion  which  emerges  from  our  research 
and  experimentation  is  that  the  expected  process  gives  a  surprisingly 
good  picture  of  the  primary  process.  Even  for  relatively  small  values 
of  the  parameters  N^,  the  basic  statistics  of  the  expected  process  are 
good  approximations  to  the  expected  values  of  the  corresponding  statis¬ 
tics  for  the  primary  process. 

This  result  lends  experimental  justification  to  the  most  natural 
approach  to  the  problem.  In  most  current  applications,  the  transition 
probabilities  are  not  known  with  certainty  anyhow;  and  some  sort  of 
“best  estimates"  are  used.  The  Bayesian  analysis  only  suggests  a 
formal  way  of  providing  these  best  estimates,  and  for  up-dating  them 
in  time. 

Still,  far  more  simulation  experience  is  needed  before  these 
conclusions  can  be  stated  with  certainty.  It  may  still  be  that  extreme 
points  from  the  prior  will  present  enough  difficulty  so  that  the  expected 
process  approach  will  not  be  good  in  an  expected  value  sense.  If 
further  research  should  indicate  that  this  is  the  case,  then  the  heuris¬ 
tics  dealing  with  variances  as  well  as  means  would  become  more 
relevant,  and  would  have  to  be  carefully  evaluated. 

It  appears  at  present  that  an  exact  solution  to  the  problem  is  not 
feasible.  Some  sort  of  approximation  procedure  will  be  necessary  to 
handle  any  good  sized  problem,  and  the  remaining  question  is  only: 
which  approximation  is  best? 

Finally,  there  is  the  matter  of  relative  cost  of  obtaining  solutions. 

In  this  regard,  it  is  significant  that  the  expected  process  technique  is  so 

easy  to  apply.  Essentially,  it  involves  only  policy  iteration,  which 
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Dr.  R.  A.  Howard  has  shown  to  be  quite  practical  even  for  problems 
with  50  states  and  50  alternatives  in  each  state. 

Thus,  it  seems  that  heuristic  methods  provide  a  feasible  and 
useful  mean  of  dealing  with  Markovian  decision  processes  with  un¬ 
certain  transition  probabilities. 
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APPENDIX  A 


DETAILS  AND  LISTINGS  OF  COMPUTER  PROGRAMS 
A.  1  General  Description 

Our  project  called  for  the  testing  of  many  different  heuristics, 
and  to  facilitate  this  task,  we  decided  to  write  our  computer  programs 
in  blocks,  using  the  subprogram  feature  of  7090  FORTRAN.  Each  block 
was  designed  as  an  independent  unit,  fulfilling  a  particular  task.  Then, 
design  of  a  new  heuristic  merely  required  writing  a  short  program  to 
couple  these  blocks  appropriately.  Briefly,  the  subprograms  are: 

IPUT:  Reads  in  an  entire  process  (prior  parameters  and 

rewards  ). 

VALUE{L):  A  programmed  version  of  R.  A.  Howard's  policy 

iteration  algorithm.  Used  to  find  optimal  policies 
and  gains. 

ITER(  L):  Used  in  conjuction  with  VALUE. 

OBS(NOBS):  Simulates  NOBS  transitions  of  the  process  and  up¬ 

dates  the  prior  accordingly. 

ISIM(IPRES):  Called  by  OBS,  this  subprogram  merely  simulates 

a  single  transition  and  reports  tne  outcome. 

OPUT  (  I,  N ) :  Causes  the  printing  of  results. 

PRIOR:  Restores  the  original  prior  in  place  of  a  posterior. 

Reinitializes  between  runs. 

GEN:  Draws  a  random  sample  point  from  the  prior  distri¬ 

bution. 

We  now  turn  to  a  more  detailed  account  of  the  subprograms  and 
the  MAINS  used  to  tie  them  together. 
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A.  2  Subroutine  IPUT 


The  source  statement  CALL  IPUT  causes  the  following  cards  to 
be  read  in: 

1.  )  NS,  or  number  of  states  in  format  12. 

2.  )  NA(I),  or  the  number  of  alternatives  in  each  state,  in  format 

12,  3X,  12,  3X,  etc. 

3.  )  Cards  with  the  prior  distribution.  Seven  entries  per  card,  10 

columns  per  entry.  First  the  P  (I,  J),  then  N.  (all  in  floating 
point ). 

4.  )  Cards  with  the  rewards.  Seven  entries  per  card,  10  columns 

per  entry. 

The  program  stores  the  values  in  the  proper  locations,  computes 
the  q.  values,  and  checks  that  all  probabilities  add  to  one.  If  an  error 
is  found,  the  message 

PROB  NOT  =  1  IN  ROW 

prints,  and  the  program  terminates.  The  prior  is  stored  in  its  working 
matrix,  but  also  in  OPR  for  reinitialization. 

A.  3  Function  VALUE  (L) 

This  function  is  called  by  a  source  statement  of  the  form 
GAIN  =  VALUE  (L) 

If  L  =  l,  it  computes  the  relative  values  v.  for  the  actual  probabilities 
under  the  current  policy. 

L  =  2,  it  computes  the  relative  values  v^  for  the  actual  probabilities  under 
the  optimal  policy. 

L=3,  it  computes  the  relative  values  v^  for  the  prior  probabilities  under 
the  current  policy. 
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L  =  4,  it  computes  the  relative  values  v^  for  the  prior  probabilities 
under  the  optimal  policy. 

L  =  5,  computes  the  values  for  both  actual  and  prior  probabilities. 

GAIN,  which  is  the  long  run  expected  gain  associated  with  the 
computed  values  v^ ,  is  returned  as  the  functional  value.  If  an  optimal 
policy  is  computed,  it  is  left  in  the  K(I)  vector  of  common  storage. 
The  values  are  always  left  in  the  V(I)  vector  of  common  storage. 

Thus,  for  example,  the  sta*'ement  GAIN  =  VALUE  (4)  would 
cause  the  optimal  policy  of  the  expected  process,  along  with  its  values 
and  gain,  to  be  computed. 

A.  4  Function  ITER  (L) 

This  subroutine  is  called  by  the  VALUE  subprogram,  and 
corresponds  to  the  "policy  improvement"  phase  of  R.  A.  Howard's 
algorithm. 

If  L=  1,  it  improves  the  policy  assuming  actual  probabilities. 

L  =  Z,  same  as  L  =  1 . 

L  =  3,  it  improves  the  policy  assuming  prior  probabilities 
L  =  4,  same  as  L  =  3. 

L  =  5,  it  chooses  an  initial  policy  for  the  actual  probabilities  by 
maximizing  q.  in  each  state. 

L  =  6,  it  chooses  an  initial  policy  for  the  prior  probabilities  by 
maximizing  q  .  in  each  state. 
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It  is  called  by  a  source  statement  of  the  form 


lOPT  =  ITER(L), 

and  lOPT  is  returned  as  1  if  the  policy  did  not  change  (optimum  found),  and 
as  2  if  the  policy  did'change  (optimum  not  yet  found). 

A.  5  Subroutine  OBS  (NOBS) 

This  subprogram  is  called  by  a  source  statement  of  the  form, 

CALL  OBS  (NOBS) 

It  simply  causes  NOBS  transitions  to  be  simulated  using  the  actual 
probabilities,  and  the  frequencies  to  be  tallied.  After  the  observations 
are  completed,  the  subroutine  performs  a  Bayesian  up-dating  of  the 
prior  before  returning  control  to  the  main  program. 

A.  6  Function  ISIM  ( IPRES) 

This  function  is  called  by  the  OBS  subprogram,  and  it  is  this 
subroutine  which  actually  does  the  simulation  of  transitions.  A  random 
number  is  drawn  from  a  retangular  (0,  1  )  distribution,  and  this  random 
number  in  conjuction  with  the  transition  probabilities  determines  a 
transition.  The  new  state  is  reported  back  to  CBS,  and  the  latter  program 
records  it,  etc. 

A.  7  Subroutine  PRIOR 

This  subroutine  simply  reinitializes  the  prior  matrix,  and  calls 
GEN  to  supply  a  new  sample  process.  PRIOR  is  the  recycle  point  of  all 
the  MAINS. 

A.  8  Subroutine  OPUT  (I,  N) 

This  subroutine  performs  several  different  functions,  since  many 

types  of  output  are  needed.  Ostensibly,  I  is  the  iteration  number,  and  N 
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is  the  number  of  observations  in  the  iteration.  If  both  I  and  N  are 
non-zero,  nornaal  output  results.  Normal  output  consists  of  the  follow¬ 
ing: 

1 .  )  The  iteration  number  I 

2.  )  The  number  of  observations  N 

3. )  The  estimated  gain,  or  gain  of  process  using  prior  probabilities 

and  current  policy.  The  program  assumes  that  this  quantity  has  been 
computed  and  is  in  the  common  storage  location  GAIN. 

4. )  The  actual  gain  of  the  current  policy.  This  is  computed  by  the 

OPUT  subroutine  from  the  actual  probabilities  before  printing. 

5. )  Accumulated  number  of  observations.  A  quantity  computed  by 

OPUT. 

6. )  Efficiency.  This  too  is  computed  by  OPUT,  and  is  defined  by; 

Previous  accumulated  profit  -f  (no.ob8.)x  (actual  gain) 
(Optimal  actual  gain  of  process)  x  (accumulated  no.  of  obs.)  ' 

where  the  numerator  becomes  the  new  previous  accumulated  profit.  The 
optimum  actual  gain  of  the  process  is  known  by  OPUT  (see  below). 

7. )  The  current  policy  being  followed.  This  must  be  in  the  K(I)  vector 

when  OPUT  is  called. 

The  source  statement  CALL  OPUT  (I,  N)  causes  a  single  line  with  this 
information  to  be  printed,  if  both  I  and  N  are  positive.  But,  I  can  also 
serve  two  control  functions: 

1  =  0: 

This  causes  the  process  information  to  be  printed.  The  prior, 
reward,  and  actual  probabilities  matrices  are  all  printed,  and  control 
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is  immediately  returned  to  the  main  program. 

I<0 

This  causes  initialization  preparatory  to  a  new  run.  The  actual 
optimum  gain  is  computed  and  stored  for  later  use  in  computing  efficiency. 
Then  a  header  is  printed  to  identify  later  output.  Finally,  a  line  identi¬ 
fied  as  iteration  0  is  printed  in  which:  estimated  gain  =  actual  gain  = 
optimum  actual  gain  of  the  process.  The  policy  indicated  is  the  optimal 
policy,  and  all  other  quantities  are  zero.  The  output  from  this  section 
looks  like: 

RUN  NO. 


n.  NO  OBS  EST  GAIN  ACT  GAIN  ACCUM  OBS  EFF  POLICY 

0  0  (actual  optimum  gain)  0  0  (actual  optimum 

policy) 

There  is  a  special  OPUT  program  for  MAIN  4  which  gives  slightly 
different  output,  and  computes  EFF  on  the  basis  of  actually  observed  re¬ 
wards. 

A.  9  Subroutine  GEN 

This  subroutine  is  called  by  a  source  statement  of  the  form 
CALL  GEN 

The  statement  causes  a  sample  process  to  be  drawn  from  the  prior  and 
placed  in  the  ’’actual  probabilities '*  locations.  The  q.  values  are  computed 
by  GEN,  and  control  is  returned  to  the  main  program. 
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The  sampling  is  done  according  to  the  method  outlined  in  section 
2.4.  As  mentioned  there,  trucation  is  used  to  assure  that  all  parameters 
are  integral,  or  else  simulation  would  not  be  possible. 

A.  10  MAIN  I 

This  is  a  programmed  version  of  the  first  heuristic  described  in 
Chapter  6:  follow  optimal  policy  for  expected  process.  It  needs  a 
parameter  card  with  the  following  items: 

1. )  IPRES,  present  state  of  system  (initial),  in  columns  1-2. 

2. )  NOBS,  number  of  observations  between  recomputations  of  policy, 

in  columns  6-8. 

3. )  NOITS,  number  of  iterations  for  each  sample  process,  in  columns 

12-14. 

4. )  IRUN,  run  number,  in  columns  18-19. 

The  process  information  should  follow  this  card  (  see  IPUT). 

A  process  is  generated  and  simulation  for  NOITS  iterations  of 
NOBS  observations  each.  Then  a  new  process  is  generated,  and  so  forth. 
There  is  no  provision  for  termination  of  the  program,  we  just  run  until 
running  out  of  time. 

A  flow  chart  follows . 
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A.  11  MAIN  2 


This  is  a  programmed  version  of  the  second  heuristic  described 
in  Chapter  6,  a  trade-off  between  immediate  rewards  and  information. 
The  following  parameter  card  is  expected: 

1. )  IPRES,  'nitial  state  of  system,  in  columns  1-2. 

2. )  IRUN,  the  run  number,  in  columns  6-7. 

3. )  NOITS,  the  desired  number  of  iterations,  in  columns  11-12. 

4. )  ALPHA,  (  see  Section  6 .  3  )  in  columns  16-20. 

5. )  BETA,  (  see  Section  6.  3 )  in  columns  24-28. 

The  process  information  follows  this  card  {  see  IPUT). 

A  process  is  generated,  and  simulation  for  NOITS  iterations. 
Then  a  new  process  is  generated,  etc.  No  termina';ion  is  provided  for-- 
program  runs  until  stopped. 

A  flow  chart  follows. 

A.  12  MAIN  3 

This  is  a  programmed  version  of  the  third  heuristic  described  in 
Chapter  6,  weighted  immv*.diate  reward  technique .  The  parameter  card 
contains ; 

1.)  IRUN,  the  run  number,  in  columns  1-2. 

?.)  IPRES,  the  initial  state  of  the  system,  in  columns  6-7. 

3. )  NOITS,  the  desired  number  of  iterations,  in  columns  11-12. 

4.  )  ALPHA,  (  see  Section  6.  4)  in  columns  16-20. 

5. )  BETA.  (  see  Section  6.  4  )  in  columns  24-28. 
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The  process  should  follow  (  see  IPUT). 

A  process  is  generated  and  simulation  for  NOITS  iterations  occurs. 
Then  a  new  sample  process  is  generated,  etc.  No  termination  is  provided 
for . 

A  flow  chart  follows. 

A.  13  MAIN  4 

This  is  a  programmed  version  of  the  fourth  heuristic  descr^.e  l  in 
Chapter  6:  the  weighted  variance  technique.  The  parameter  card  cc.i.ains: 

1. )  IRUN,  the  run  number,  in  columns  1-3. 

2. )  IPRES,  the  initial  state  of  the  system,  in  columns  7-9- 

j.)  NOITS,  the  desired  number  of  iterations,  in  columns  13-15. 

4.)  CEE  (  see  Section  6.  5  ),  in  columns  19-28. 

The  process  information  follows  the  parameter  card  (see  IPUT  ). 

A  sample  process  is  generated  and  simulation  for  NOITS  iterations 
occurs.  Then  a  new  sample  process  is  generated,  etc.  No  termination 
is  provided  for . 

A  flow  chart  follows. 

A.  14  Distribution  of  gain 

This  is  a  programmed  version  of  the  algorithm  discussed  in 
Chapter  5.  The  first  data  card  must  be  a  parameter  card: 

1. )  IRUN,  run  number,  in  columns  1-3. 

2. )  MAX,  or  maximum  number  of  sample  pioints,  in  columns  7-10. 

3. )  MTIM,  in  columns  14-17. 
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MTIM  is  used  in  conjuction  with  the  MIT  clock-interrupt,  and 
specifies  the  latest  time  at  v^ich  sampling  should  end  and  output  begin, 
measured  from  the  termination  time  of  the  program.  If  MTIM  is  set 
at  less  than  200,  it  will  insure  that  no  output  is  lost  by  running  over¬ 
time,  while  it  will  not  interfere  with  normal  running  if  overtime  would 
not  occur. 

The  program  reads  only  one  parameter  card,  but  after  each  out¬ 
put  it  returns  to  read  the  information  for  another  process.  So,  by 
stacking  the  information  for  several  processes  behind  the  parameter 
card,  any  number  can  be  investigated. 

NOTE;  Since  gain  refers  to  a  policy,  the  process  supplied 
must  have  only  one  alternative  per  state. 

A  flow  chart  follows. 

A.  15  Solution  of  the  special  two -state  problem 

This  is  a  programmed  version  of  the  solution  routine  described 
for  a  two-state  process  in  Chapter  3.  It  requires  three  data  cards. 

1.  )  BETA  and  NMAX,  the  value  to  be  taken  as  infinity. 

2. )  The  known  probabilities  P  (  1,  1  ),  P(  1, 2),  P(  2,  1  ),  P(  2,  2). 

3.  )  The  rewards  for  the  known  and  unknown  alternatives: 

R(  1,  1  ),  R(  1,2),  R(2.  1  ).  R(2.2).  R(3,  1),  R(3,2) 

All  entries  are  10  columns  each,  floating  point,  except  NMAX  which  is  a 
5  position  integer. 

The  program  prints  a  column  of  values  for  each  n  <  50,  0  ^r  :£n. 
When  all  columns  have  been  printed,  a  summary  chart  of  the  decision 
boundary  is  printed. 
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n  n 


program  LISTTTsir.t; 


SUBROUTINE  IPUT 
SUBROUTINE  FOR  DATA  INPUT 


8 

11 

12 


dimension  OPR(150.17) 

COMMON  OPR 

eou7Sai  c1II?;'^*!*’'’’^’*^*''*^^**^'^es,gain.irun 

EQUIVALENCE  <  A  » W  ) .  ( W(  226  )  *WORIC2  ) 

read  12»  ns 

READ  12.  { NA{ I ) .  1*1, NS) 

NUM  =  NAII) 

NACl)  *  0 
DO  5  I»1,NS 
NX  *  NA ( I  +  l ) 

NA(I-M)  «  num 
NUM  a  NUM  ■¥■  NX 
NL  »  NS  -f  1 
N  »  NAINL) 

DO  6  I  *1 *N 

read  11,  {P{l,j),  Jal,NL) 

DO  7  lal.N 

read  11.  (Rd.j),  Jai,NS) 

DO  8  lal.N 
DO  0  Jal.NL 

OPR( I .J)aP( I ,J) 

DUM  a  VALUE(5) 

FORMAT  {7F10.6) 

FORMAT  (  I2.14{3X»I2) ) 

DO  15  1*1. N 

SUM  a  0*0 


AW# 


SUM2  a  0,0 
DO  14  J*1,NS 

14  SUM  a  SUM  +  P(I.J) 

ERF  a  .0001 

IF  (ABSF{l.-SUM)-ERF)  15.18.18 

15  CONTINUE 

return 

18  PRINT  19.  I 

19  FORMAT  {21H1PROB  NOT  -  1  IN  ROW  .12) 
CALL  EXIT 

END 
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FUNCTION  VALUE(L) 

C  VALUE  SUBPROGRAM  10-16-63 

DIMENSION  NA( 16) .K( 15) .AP( 150*16  )  *R( 150.16) .P( 150*17 ) .W( 15.16) 
DIMENSION  V(  15)  *WORi;2  ( 15  )  *  A(  15*15) 

DIMENSION  OPR( 150*17) 

COMMON  OPR 

COMMON  NA,K*AP.R*P*W*V*NS*IPRES*GAIN,IRUN 
EQUIVALENCE  (  A  * W  )  *  (  W  (  226  )  * WOR>:2  ) 

I  OPT  =  2 
NS“=NS-1 

7  GO  TO  (1*1*2*2*3)*L 

1  DO  4  I=1*NS 
LL*NA( I )+<( I ) 

WORK2( I )=AP(LL*N6+1) 

5  DO  4  J=1*NSM 

4  A(  I.J)  «  -AP(LL*J) 

6  DO  8  I-l.NS 

A(  1*1)  *  A( I *I  )  +  1* 

8  Ad.NS)  *  1* 

C  CALL  LINEAR  EQUATION  SOLUTION  ROUTINE 

SCALE  *  1. 

M  •  XSIMEOFI  15.NS.1  *A*W0RIC2*SCALE*V) 

9  GO  TO  ( 10.11.11 ) .M 

10  DO  12  1*1. NSM 

12  V( I )  *  A( I  .1) 

V(NS)  »  0. 

VALUE  *  A(NS.l) 

13  GO  TO  ( 14*15*14.15) .L 

C  CALL  ITERATION  ROUTINE 

15  lOPT  e  ITER(L) 

16  GO  TO  ( 14*7) * lOPT 

C  IF  L  *  3  OR  4*  GO  HERE 

2  DO  17  I-l.NS 
LL*NA(  I  )+Kn  ) 

WORK2( I )  »  P(LL*NS+2) 

18  DO  17  J*1.NSM 

17  A( I.J)  •  -P(LL*J) 

GO  TO  6 

C  IF  L-5.COME  HERE 

3  NSM  »  NA(NS+1) 

19  DO  20  I-l.NSM 
AP(I*NS+1)  *0. 

P(  I  *NS+2 )  *  0. 

21  DO  20  J»1*NS 

AP(I*NS+1)  «  AP(I*NS+1)  +  API  I*J)«R( I.J) 

20  P(I.NS+2)  •  P(I*NS+2)  +  P( I* J)«R( I *J) 

VALUE  »  0. 

FREQUENCY  7(1*1*1*7*1)*1{15)*5(15)*6(15)*9(1*0*0)*10(15) 
FREQUENCY  13 ( 1  *  1  *  1 *7 ) *  16  1 1 .5 ) *2 ( 15 )  .  18 ( 15 ) *  19 ( 15  )  * 21 ( 15  ) 

14  return 

C  ERROR  EXIT 

11  PRINT  30 

30  FORMAT  ( 23H  NO  SOLUTION  FOR  VALUES) 

CALL  EXIT 

END  -70- 
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FUNCTION  ITER(L) 

ITERATION  SUBPROGRAM  10-16-63 

DIMENSION  NA( 16) »K{ 15) *AP(150*16) *R( 150.16) *P< 150»17) fW( 15tl6) 
DIMENSION  V( 15) .WORK2( 15). A( 15. 15) 

DIMENSION  OPR{150.17) 

COMMON  OPR 

COMMON  NA.K»AP.R*P.W«V.NS*IPKES.GAIN,IRUN 
EQUIVALENCE  ( A . W ) » ( W( 226 ) .WORK2 ) 

ITER  -  1 

2  DO  1  1  =  1. NS 
TEMP  =  -99999. 

NTEMP  »  0 

IMIN  =  NA( I )  +1 
IMAX  «  NA( 1  +  1 ) 

3  DO  15  M=IMIN.IMAX 

5  GO  TO  (6.6. 7. 7. 8. 9)  .L 

6  TEST  =  AP(M.NS+1) 

10  DO  11  J=1.NS 

11  TEST  -  TEST  +  AP(M.J)»V(J) 

GO  TO  12 

7  TEST  =  P(M.NS+2) 

14  DO  13  J*1.NS 

13  TEST  «  TEST  +  P(M.J)»V(J) 

GO  TO  12 

0  TEST  «  aP(M.NS+1) 

GO  TO  12 

9  TEST  =  P(M.NS+2) 

12  IF  (TEST-TEMP)  15.16.17 

16  IF  (NTEMP  -  K(I))  17.15.17 

17  NTEMP  =  M-NA( I ) 

TEMP  =  TEST 

15  CONTINUE 

18  IF  (NTEMP  -  X(  I  )  )  19.1.19 

19  ITER  »  2 

K ( I )  *  NTEMP 
1  CONTINUE 
RETURN 

FREQUENCY  2(15). 3(13), 5(1, 1,60. 60. 1.20). 10(15). 14 (15). 12(10. 0.10) 
FRE(3UENCY  16  (  1 . 1 . 1  )  .  18  ( 1 .2 ,1  ) 

END 
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SUBROUTINE  08S(N0BS) 

SUBROUTINE  FOR  OBSERVING  THE  PROCESS  AND  FOR 

UPDATING  THE  ESTIMATED  PROBABILITIES  AND  THE  Q'S 

DIMENSION  NA( 16) »IC( 15 ) .AP( 150.16) .R( 150*16) .P( 150.17) .W( 15.16) 
DIMENSION  V(  15)  *WOR)C2  ( 15  )  . A(  15.15) 

DIMENSION  OPR(150.17) 

COMMON  OPR 

COMMON  NA.K.AP.R.P.W.V.NS.IPRES.GAIN.IRUN 
EQUIVALENCE  ( A.W) . (W( 226) .WORK2 ) 

NSl  »  NS  +  1 
DO  2  1*1. NS 

DO  2  J*1.NS1 

2  WC I  .J)  *  0,0 

DO  3  1*1. NOBS 

INEXT  =  ISIM(IPRES) 

W( IPRES. INEXT )  «  W( IPRES. INEXT)  +  1.0 
W(IPRES.NSI)  *  W(IPRES.NSl)  +  1.0 

3  IPRES  *  INEXT 
DO  6  I=1*NS 
IF  (Wd.NSD)  1*6*4 

C  COMPUTE  NEW  Q( I ) 

4  IP  =  K( I )  +  NA( I ) 

OLDEN  *  P( IP.NSl  ) 

P(IP.NS+1)  -  OLDEN  +  W(I.NSl) 

P(IP.NS+2)  «  0.0 

DO  5  J»1.NS 

P(IP.J)  *(P( IP.J)*0LDEN  +  Wl I.J) )/P( IP.NS+1) 

5  P(IP.NS+2)  *  P(IP.NS+2)  +  P( IP.J)*R( IP.J) 

6  CONTINUE 
RETURN 

1  PRINT  501 

501  FORMAT  ( lOH  ERROR  501) 

CALL  EXIT 
END 
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FUNCTION  ISIM(IPRES) 

simulate  one  observation 


3 

2 

5 


DIMENSION  NA(16).K(15).AP(150.16).R(1&0.16),P(150.17),W( 15,16) 
DIMENSION  V(15).W0RK2(15).A(15,15)  .  a M .W ( 1 5 . 16 ) 

DIMENSION  OPR ( 150 » 17) 

COMMON  OPR 

COMMON  NA»K,AP*R»P,NsV*NS*IPRES»GAIN»I RUN 
EQUIVALENCE  (  A » W  )  » ( W  (  226 )  *WORIC2  ) 

IP  -  K( IPRES)  +  NA( IPRES) 

AA  *  RANNOF(X) 

DO  2  J*1,NS 

AA  »  AA  -  AP( IP.J) 


IF  (AA)  3#3.2 
ISIM  *  J 
GO  TO  5 
CONTINUE 
ISIM  »  NS 
RETURN 
END 


N 
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subroutine  0PUT( I tN) 

C  OPUT  3  ONE  LINE  OUTPUT 
C  I  IS  THE  ITERATION  NUMBER 

C  N  IS  THE  NUMBER  OF  OBSERVATIONS  IN  THE  ITERATION 

DIMENSION  NA( 16) .K( 15) tAP( 150.16) tR( l50t 16) fPl I50t 17) tW( 15»16) 
DIMENSION  V( 15) »WORK2(15) .A( i5.15) 

DIMENSION  OPR(150»17) 

COMMON  OPR 

COMMON  NA,K,AP*R*P.M*V*NS«IPRES.GAtN.IRUN 
EQUIVALENCE  ( A . W ) . ( W ( 226 ) » WORK2 ) 

DIMENSION  HOLOIIS) 

C 

IF  (I)  52.3.2 

C  It  1*0.  PRINT  MATRICES 

3  PRINT  501. IRUN 
501  FORMAT  (8H1RUN  NO  .lA) 

C  PRINT  actual  probability  MATRIX 

PRINT  505 

5C5  FORMAT  (26H1ACTUAL  PROBABILITY  MATRIX) 

LA*  1 

13  PRINT  506. (L.L=1*10) 

506  FORMAT  ( 6X ♦ 10 ( 6X *  1 2 ) . 5X .SHOBS . 1 1 X . IHO ) 

DO  9  L»1.NS 

MAX  «  NA(L+1)-NA(L) 

DO  9  ICA*1.MAX 
IP  e  NA(L)+KA 
GO  TO  (6.7.8) .LA 

6  PRINT  507.L.<A.{A  ( ^ P , J ) , J* I , NS  ) 

507  FORMAT  (IHO . I  2  * IH . I  2 . 2X . 1 0F8 . 4/ ( 8X ♦ 10F8 . 4 ) ) 

PRINT  531.  AP(IP*NS+1) 

331  FORMAT  (1H+.99X.F10,4) 

GO  TO  9 

7  PRINT  513.L.KA. (R( IP.J).J»1.NS) 

513  FORMAT  ( IHO . I  2 . 1 H . I  2 . 2X . 1 0F8 • 2/ ( 8X # 10F8 . 2 ) ) 

GO  TO  9 

8  PRINT  507.L.XA.(P(IP.J).J»1.NS  ) 

PRINT  520.  P(IP.NS+i; .P( IP.NS+2) 

520  FORMAT  ( 1H+.88X.F10,0.1X.F10,4) 

L  9  CONTINUE 

GO  TO  ( 10.11.12)  »  LA 
C  PRINT  REWARD  MATRIX 

10  LA  *  2 
PRINT  508 

508  FORMAT  ( 14H1REWARD  MATRIX) 

GO  TO  13 

11  LA  =  3 
PRINT  514 

514  FORMAT  ( 29H1EST I  MATED  PROBABILITY  MATRIX) 

GO  TO  13 

C  COMPUTE  TRUE  OPTIMUM  POLICY. 
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12  RETURN 
52  MAX  ■  ITER(5) 

OGAIN  «  VALUE(2) 

:  PRINT  HEADER 

PRINT  530.  IRUN 

530  FORMAT  (8H1RUN  NO  .U/51H  IT  NO  OBS  EST  GAIN  ACT  GAIN  ACCUM 
lOBS  EFF.5X.6HPOLICY) 

EGAIN  *  OGAIN 
AGAIN  =  OGAIN 
ACCOBS  =  0.0 
PROF  »  0.0 
EFF  «  0.0 
I  =  0 

50  PRINT  502. I. N. EGAIN. AGAIN. ACCOBS. EFF. (K(J).J*1. NS) 

502  FORMAT  (1X.I3.4X.I4.1X.F10,2.1X.F10.2.1X.F9.0.2X.F6.4.2X.I2.14( 
11H.I2) ) 

RETURN 

C  IF  NOT  I-O.COME  HERE 

C  STORE  CURRENT  VALUES  FO  COMPUTE  ACTUAL  VALUES. 

2  00  14  KA*1.NS 
14  HOLD(ICA)  =  V(KA) 

AGAIN  =  VALUE! 1  ) 

EGAIN  *  GAIN 
DO  16  KA  *  l.NS 
16  V(KA)  =  HOLD(ICA) 

FNOBS  =  N 

PROF  *  PROF  +  FNOBS*AGAIN 
ACCOBS  -  ACCOBS+FNOBS 
EFF  »  PROF/(OGAIN*ACCOBS) 

GO  TO  50 
END 
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SUBROUTINE  OPUT( I .N) 

C  OPUT  4  FOR  USE  WITH  MAIN  4  ONLY  —  EFF  IN  TERMS  OF  REWARDS 

C  I  IS  THE  ITERATION  NUMBER 

C  N  IS  THE  NUMBER  OF  TIMESS  OPTIMUM  WAS  USED 

DIMEF  ilON  NA( 16) .K( 15) .AP( 150*16) tRC 150*16) tP( 150.17) tW( 15» 16) 
DIMENSION  V( 15) *WORK2 ( 15 ) * A( 15*15) 

DIMENSION  OPR( 150.17) 

COMMON  OPR 

COMMON  NA*K*AP*R*P*W*V*NS*IPRES*GAIN*IRUN 
EQUIVALENCE  ( A *W ) * ( W( 226 ) *WORK2 ) 

DIMENSION  HOLD! 15) 

C 

IF( I )  2*3*2 

C  IF  I-O*  PRINT  MATRICES 

3  PRINT  501.IRUN 
511  FORMAT  (8H1RUN  NO  .U) 

C  PRINT  ACTUAL  PROBABILITY  MATRIX 

PRINT  505 

505  FORMAT  (26H1ACTUAL  PROBABILITY  MATRIX) 

LA«il 

13  PRINT  506* (L*L»1 *10) 

506  FORMAT  ( 6X *  1 0 { 6X *  1 2 ) ♦ 5X *3HOBS *  1 IX * IHO ) 

DO  9  L»1*NS 

MAX  »  NA(L+1 )-NA(L) 

DO  9  KAal»MAX 
IP  *  NA(L)+KA 
GO  TO  (6*7*8) *LA 

6  PRINT  507.L.KA* ( AP( IP*J) *J*1 *NS  ) 

507  FORMAT  I IHO *  I  2  * IH *  I  2  * 2X *  1 0F8 • 4/ ( 8X * 10F8 *4  )  ) 

PRINT  531*  AP( IP.NS+l ) 

531  FORMAT  ( 1H+ *99X * F 1 0, 4 ) 

GO  TO  9 

7  PRINT  513*L*KA*(R( IP* J) *J=1*NS) 

513  FORMAT  (IHO *  I  2  * IH *  I  2  * 2X *10F8 *2/ ( 8X * 10F8,2  )  ) 

GO  TO  9 

8  PRINT  507*L*KA*(P(IP*J) *Jsl*NS  ) 

PRINT  520*  P( IP.NS+1 ) *P( IP*N5+2 ) 

520  FORMAT  ( 1H+,88X *F10*0.1X*F10.4) 

9  CONTINUE 

GO  TO  ( 10*11*12)  *  LA 
C  PRINT  REWARD  MATRIX 

10  LA  =  2 
PRINT  508 

508  FORMAT  (14H1REWARD  MATRIX) 

GO  TO  13 

11  LA  =  3 
PRINT  514 

514  FORMAT  ( 29H1 EST I  MATED  PROBABILITY  MATRIX) 

GO  TO  13 

C  COMPUTE  TRUE  OPTIMUM  POLICY. 


-76- 


12  MAX  =  ITER(5) 

OGAIN  «  VALUE(2) 

C  PRINT  HEADER 

PRINT  530*  IRUN 

530  FORMAT  (8H1RUN  NO  *I4/51H  IT  NO  OPT  EST  GAIN  ACT  GAIN  ACCOM 
lOBS  EFF,5X*6HP0LICY) 

EGAIN  =  OGAIN 
AGAIN  =  OGAIN 
ACCOBS  *  0.0 
PROF  =  0*0 
EFF  «  0.0 

50  PRINT  502»I»N*EGAIN»AGAIN*ACCOBS*EFF*IKIJ)*J-1*NS) 

502  FORMAT  ( IX  *  I  3 *4X » I  4 » IX »F10.2 *  IX *F10. 2  *  IX *F9.0 *  IX  * F6. 2 *2X *  I  2  *  14 ( 
11H.I2) ) 

RETURN 

C  IF  NOT  I=0*COME  HERE 

C  STORE  CURRENT  VALUES  FO  COMPUTE  ACTUAL  VALUES. 

2  PROFIT  »  WORX2( 1  ) 

AGAIN  =  VALUEI 1 ) 

PROF  =  PROF  +  PROFIT 
ACCOBS  =  ACCOBS  +  100* 

EFF  =  IPROF/(CoAIN*ACCOBS) )*100. 

EGAIN  =  GAIN 
GO  TO  50 
END 
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c 

c 


SUBROUTINE  PRIOR 

RESTORES  ORIGINAL  PRIOR  AND  GENERATES  A  SAM  E  PROCESS 


DIMENSION  NA(16)fK(15)tAP{150»16)*R(150»16)»P(150»17)iW(15,iAi 
DIMENSION  V{15).WORK2(15).A(15.15)  ^  ‘  1  PD  ,  1 7  )  ,  W  (  1 5 , 1 6  ) 

DIMENSION  OPR{15D*17) 

( DMMON  OPR 

COMMON  NA,K,AP.R*P,W.ViNS,IPRES.GAIN. IRUN 
EQUIVALENCE  ( A vW ) i ( W( 226 ) »WORK2 ) 


K»NS+2 

MAX  =  NAIK-l) 
DO  1  1=1, MAX 


DO  1  J=1,K 

P( I iJ)=OPR( I , J) 

CALL  GEN 

RETURN 

END 
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SUBROUTINE  GEN 

C  GENERATOES  A  SAMPLE  PROCESS 

C  SPECIFICATIONS 

DIMENSION  NA(  16 ) »K  < 15 ) .API  150.16 ) tR ( 150*16 ) tP ( 150.17 ) *WJ 15* 16) 
DIMENSION  V{ 15) *W0RK2( 15). A( 15.15) 

DIMENSION  OPR( 150*17) 

COMMON  OPR 

COMMON  NA*K*AP*R*P.W*V*NS*IPRES*GAIN*IRUN 
EQUIVALENCE  ( A  * W ) * ( W ( 226 ) * W0RK2 ) 

N  =  NA(NS+1) 

DO  22  I*1*N 
SUM  *  0. 

DO  20  J«»1*NS 

M  »  P( I *J)»P( I *NS+1 ) 

RAN  =  0. 

DO  21  K=1*M 

AA  =  RANNOFIX) 

21  RAN  =  RAN  -  .05*LOGF{AA) 

AP(I*J)  =  RAN 

20  SUM  =  SUM  +  AP( I  * J) 

DO  22  J=1*NS 

22  API  I *J)  =  API  I *J )/SUM 
DUMMY  =  VALUE! 5) 

RETURN 

END 
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MAIN  1  5/5/64 

FOLLOW  OPTIMAL  POLICY  OF  EXPECTED  PROCESS 
C 

DIMENSION  NA( 16)  .K( 15) .API  150.16)  .R( 150. 16) ,P( 150.17) .W( 15*16) 
DIMENSION  V( 15) .WORK2 (15 ) .A{ 15.1 5  ) 

DIMENSION  OPR(150.17) 

COMMON  OPR 

COMMON  NA.K.AP.R.P.W.V.NS.IPRES.GAIN. IRON 
EQUIVALENCE  ( A.W) .(W(226) .WORX2) 

READ  502.  IPRES.NOBS.NOITS.IRUN 
CALL  IP'JT 

502  FORMAT  (  I  2 . 3X . I  3 . 3X  ,  i  3 . 3X  .  12  ) 

CALL  OPUT(0,0) 

C  CHOOSE  AN  ITITIAL  POLICY  USING  ESTIMATES 

IDUM  =  ITER(6) 

C  WASTE  25  RANDOM  NUMBERS 

AA  *  SETUF ( IRUN  ) 

DO  7  1=1.25 
7  AA=RANNOF(X) 

5  CALL  PRIOR 

CALL  OPUT  (-1.0) 

c  simulate  noits  iterations 

00  2  1=1. NOITS 
GAIN  *  VALUE(4) 

CALL  OBS(NOBS) 

CALL  OPUT( I, NOBS) 

2  CONTINUE 
GO  TO  5 
END 
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nn 


^RAOE^OFF  BETWEEN  GAIN  AND  INFORMATION 

DIMENSION  NA(16).K(15)»AP1150.16).R(150.16).P(150.17).W(15.16) 
DIMENSION  V( 15) .WORK2(15) .A( 15.15) 

DIMENSION  OPR( 150.17 ) 

COMMON  OPR 

COMMON  NA.K.AP.R.P.W.V.NS.IPRES.GAIN.IRUN 
EQUIVALENCE  ( A . W ) . ( W ( 226 ) . WORK2 ) 

DIMENSION  HOLD! 150) 

READ  501 .IPRES. IRUN.NOITS. ALPHA.  BETA 
501  FORMAT  ( I  2 . 3X . I  2 . 3X . I  2 , 3X . F5 . 2 . 3X . F5* 2 ) 

CALL  IPUT 
C  WASTE  25  RANDOM  NUMBERS 

AA--  SETUP  !  I  RUN) 

DO  7  1=1.25 
7  AA  =  RANNOF(X) 

NS  =  NS 

MAX  =  NA(NS+1) 

CALL  OPUT(O.O) 

6  CALL  PRIOR 

CALL  OPUT  (-1.0) 

DO  1  KKK=l.NOITS 

C  PUT  Q'S  IN  HOLDING  AREA  AND  RELATIVIZE 

QMIN  =  999999. 

DO  2  I»1 .MAX 
TEMP  =  P ( I .NS+2 ) 

2  QMIN  =  MINIF(QMIN.TEMP) 

DO  3  1=1. MAX 

HOLD( I )  =  P I  I .NS  +  2)-QMlN 

3  PII.NS+2)  =  (HOLD! I )**ALPHA) /P( I .NS+1 ) 

C  COMPUTE  POLICY  BY  MAXIMIZING  W 

IDUM  =  I  TER (6) 

C  FIND  MINIMUM  W 

WMIN  =  999999. 

DO  A  1=1 .NS 
IDUM  =  K( I  )+NA(  I  ) 

TEMP  r.  P{  IDUM. NS  +  2) 

4  WMIN  =  MINIF (WMIN. TEMP ) 

C  RESTORE  Q»S 

DO  5  1=1 .MAX 

5  P(I.NS+2)  =  H0LD(I)+QMIN 

C  OBSERVE  PROPORTIONAL  TO  WMIN 

NOBS  =  (BETA*WMIN)  +  1. 

GAIN  =  VALUE(3) 

CALL  08S(N0BS) 

CALL  OPUT( KKK.NOBS) 

1  CONTINUE 
GO  TO  6 
END 
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C  MAIN  3 

WCI6HTE0  REWARDS  RECHNIQUE 
C= (ALPHA*0 ) /N  +1 

DIMENSIONNA( 16) »K( 15) t AP ( J 50 . 16 ) » R ( 1 50  *  1 6 ) » P ( 1 50  *  1 7 ) * W ( 1 5  *  16 ) 
DIMENSION  V( 15) .W0RK2 ( 15 ) » A ( 15.15) 

DIMENSION  OPR(150.17) 

COMMON  OPR 

COMMON  NA.K.AP.R.P.W.V.NS.IPRES.6AIN.IRUN 
EQUIVALENCE  (  A.  W  )  .  ( W(  226  )  .WORIC2  ) 

DIMENSION  C(150) 

READ  501 » I  RUN. I  PRES. NO I IS .ALPHA .BETA 
501  FORMAT  ( 3 ( 1 2 . 3X ) . F5 . 0 . 3X .F5 . 0 ) 

CALL  IPUT 
CALL  OPUT  (0.0) 

C  WASTE  25  RANDOM  NUMC ERWS 

AA  =  SETUF( IRUN) 

DO  2  1=1.25 

2  AA  »  RANNOF(X) 

IDUM*! TER( 6  ) 

NS  =  NS 

MAX  =  NA(NS+1 ) 

99  CALL  PRIOR 

CALL  OPUT(-l.O) 

DO  3  I=l.NOITS 
DO  4  J=1,MAX 

C(J)  =  (ALPHA*ABSF(P( J,NS+2) ) )/P(J.NS+l)  +  1.0 

4  P(J»NS+2)  *  P( J.NS+2 )«€( J) 

GAIN  =  VALUE(4) 

FNOBS  =  0# 

DO  7  J=1 .MAX 

7  P(J.NS-»-2)  *  P(  J.NS  +  2  )/C(  J) 

C  COMPUTE  EST.  GAIN  FOR  POLICY  USING  NON-DUMMY  Q'S 

GAIN  »  VALUE(3) 

DO  5  J=1»NS 
IMIN  =  NA(J)  +1 
IMAX  =  NA(  J-H  ) 

CMIN  =  999999# 

DO  6  L=IMN.IMAX 
IF  (CMIN-C(L) )  6»6»8 

8  CMIN  »  C(L) 

ICMIN  »  L 

6  CONTINUE 

IDUM  =  NA(J)  K(J) 

IF( IDUM-ICMIN)  10.11.10 
11  FNOBS  =  FNOBS  BETA*P(  IDUM.NS  +  1  ) 

GO  TO  5 

10  FNOBSx(P(  ICMIN.NS•^1  )*ABSF(P(  IOUM.NS-*-2)  )  )/ABSF(P(  ICMIN.NS-^2)  )“ 
1P(  lOUM.NS-t-1)  +  FNOBS 

5  CONTINUE 
NOBS  =  FNOBS 
CALL  OBS(NOBS) 

CALL  OPUT( I , NOBS) 

3  CONTINUE 
GO  TO  99 
END 
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C  MAIN  4  WEIGHTED  VARIANCE  TECHNIQUE 

DIMENSION  NA( 16) .K( 15) »AP( 150.16) .R( 150. 16) .P( 150.17) .W(  15.16) 
DIMENSION  V(  15)  .WORIC2(  15)  .A(  15.15) 

DIMENSION  OPR( 150.17 ) 

COMMON  OPR 

COMMON  NA.IC.AP  .R  .P  .  W  .  V  .NS  .  I  PRES  .  GA  I N  .  I  RUN 
EQUIVALENCE  ( A  .W ) . ( W ( 226 ) .WORK2 ) 

DIMENSION  Q(20) 

READ  501.  IRUN. IPRES.NOITS.CEE 
501  FORMAT  (  I  3 . 3X  .  I  3 . 3X . I  3 . 3X . FIO .5 ) 

C  WASTE  SOME  RANDOM  NUMBERS 

AA  =  SETUF( IRUN) 

DO  2  I  =  1.25 
2  AA  =  RANNOF(X) 

CALL  IPUT 
14  CALL  PRIOR 

CALL  OPUT  (0.0) 

DO  3  JJJ*l.NOITS 
NOPTS  =  0 
PROFIT  =  0.0 
DO  4  KKK  =  1.100 
EGAIN  *  VALUE(4) 

KOPT  *  K( IPRFS) 

MIN  «  NA( IPRES)+1 
MAX  =  NA( IPRES+1 ) 

VAR  =  0,0 
NALT  =  0 
DO  5  I=MIN.MAX 
TEST  =  0,0 
DO  6  J=1.NS 

6  TEST  =  TEST  +  R ( I . J ) *P ( I . J )» ( 1,-P ( I . J ) ) 

TEST  =(TEST/(((P(I ,NS+1 ) )*(P( I .NS+1 )+l. ) )**2) )*CEE 
IF  (TEST-VAR)  5t5,7 

7  VAR  =  TEST 
NALT  =  I 

5  CONTINUE 

IF  (NA( IPRFS)+K( IPRES)-NALT)  b.13,8 

8  DO  10  I=MIN.MAX 
J=I-MIN+1 

Q(  J  )  =P  (  I  ,NS-*-2  ) 

10  P  (  I  ,NS-»-2  )  *  -999999. 

J  =  NALT  -  MIN  1 
P(NALT,NS'^2)  =  Q(J) 

TGAIN  *  VALUE(4) 

DO  11  I-MIN.MAX 
J=I-HIN+1 

11  P( I.NS+2)  =  Q( J) 

IF  (EGAIN  -  TGAIN  -  VAR)  12tl2,13 
13  K(IPRES)  «  KOPT 
NOPTS  «  NOPTS  +  1 
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12  lOLD  »  NA(IPRES)  +  K(IPRES) 

CALL  OBS( 1 ) 

IPRES  ■  IPRES 

PROFIT  ■  PROFIT  +  R ( I  OLD* I  PRES ) 
4  CONTINUE 

GAIN  =  VALUE! 4) 

WORK2(l)  =  PROFIT 
CALL  OPUT  (JJJ.NOPTS) 

3  CONTINUE 
GO  TO  14 
END 
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MAIN5  PLOT  DISTR.  OF  GAIN 
SPECIFICATIONS 

DIMENSION  NA( 16) •<( 15 ) .API 150#16) »R( 150. 16) fP{ 150.17) .W( 15»16) 
DIMENSION  V(15).WORK2(15).A( 15.15) 

DIMENSION  OPRI 150.17) 

COMMON  OPR 

COMMON  NA,<.AP.R.P.W.V.NS.IPRES.6AIN.IRUN 
EQUIVALENCE  I  A . W ) . ( W (  226 ) . WORK 2 ) 

DIMENSION  G(51) 

READ  501 . IRUN.MAX.MTIM 
501  FORMAT  ( I  3 . 3X . 1 4 . 3X . I  4 ) 

C  WASTE  SOME  RANXOM  NUMBERS 
AA  -  SETUFI IRUN) 

DO  12  1=1.25 
12  AA  =  RANNOF(X) 

14  CALL  IPUT 

C  COMPUTE  RANGE  OF  GAIN 

PGAIN  =  VALUE(4) 

PRINT  5039PGAIN 
RMIN  =  PGAIN».85 
RMAX  =  PGAIN»1.15 
RINC  =  (RMAX-RMIN)/50. 

C  INITIALIZE 

GSUM  -t  0. 

GSO  =  0. 

DO  3  1=1.51 

3  G(  I  )  =  0. 

C  BEGIN  SIMULATION  LOOP 

DO  6  KKK  =  l.MAX 
CALL  GEN 
GAIN  =  VALUE(2) 

GSUM  =  GSUM  +  GAIN 
6SQ  =  GSQ  +  GAIN»GAIN 
GAIN  =  GAIN  -  RINC 
DO  7  JK=1.50 
IF(GAIN-RMIN)  5.5.4 

5  GIJK)  =  GIJK)  +1. 

GO  TO  9 

4  GAIN  =  GAIN  -  RINC 
7  CONTINUE 

G(51)  =  GISl)  +  1# 

9  FMAX  =  KKK 

CALL  TIMLFT(JTIM) 

IFI JTIM-MTIM)  10.10.6 

6  CONTINUE 

10  GMEAN  =  GSUM/FMAX 

GVAR  =  GSO/FMAX  -  GMEAN*GMEAN 
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PRINT  502,GMEAN*GVAR*P6AIN*FMAX 

502  FORMAT  ( IH 1 ♦ 4HGA 1 N ♦ 2X tAMPROB » 2X fGHME AN  » fF 10 .4 ♦2X ♦ 5HVAR  **F10*4 
1*2X,10HEXP  GAIN  » * FIO . 4*2X ♦ 1 OHNO  SAMPS  **F6*0) 

DO  8  I=l»51 
G( I )  .  G( I  )/FMAX 
RMIN  *  RMIN  +  RING 
PRINT  503*RMIN.G( I ) 

503  FORMAT  ( IX *F lO *2 ♦ 2X ♦ F 10 . 5  ) 

8  CONTINUE 

GO  TO  14 
END 
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C  NUMERICAL  SOLUTION  TO  SPECIAL  TWO  STATE  PROBLE‘< 

DIMENSION  P(2f2).  R(3»2).  V(IOOO).  IBDY(IOOO) 

DO  21  I  I^lf 6 

2  READ  101  .  BETA,NMAX 

READ  102.  P(l.l).  P(1.2).  P(2.1).  P(2.2) 

READ  102*  R(l.l).  R(1.2).  R(2.1).  R(2.2).  R(3.1).  R(3.2) 

101  FORMAT  (FIO.0,15) 

102  FORMAT  (6F10,0) 

A1  =  R(3.1)  +  BETA*(P( 1 .1 )*R( l.l )  +  P ( 1 . 2 ) *R ( 1 . 2 )  ) / ( 1  .0  - 
1  P(1.1)*BETA) 

A2  =  (P( 1.2)*BETA*BETA)/{1.0  -  P(1.1)»BETA  ) 

A3  =  R(3.2) 

A4  =  BETA 

C  =  (A4*P(2.1)*(P( 1.1)»R(1.1 )  +  P(1.2)*R(1.2) )  +  (1.0  -  A4*P(1.1)) 

1  *(P( 2.1 )*R( 2.1 )  +  P(2.2)*R(2.2) ))/( (1.0  -  A4*P ( 1 . 1 ) ) » ( 1 .0  -  A4* 

2  P(2.2))  -  P( 1.2)*P(2.1)»A4*A4) 

DO  5  I=1.NMAX 

FN=NMAX 
S=  I 

V(I)  =  (S*A1/FN  +  (1.0  -  S/FN)»A3)/( 1.0  -  S*A2/FN  -  (1.0  -  S/FN)* 

1  A4) 

IF (V( I ) -C) 6.6.5 

6  V( I )=C 
MAXR*I 

IBDY(NMAX)«MAXR 
GO  TO  7 
5  CONTINUE 
GO  TO  3 

7  DO  4  I=MAXR.NMAX 
4  V(  n=c 

3  NM=NMAX-1 

20  FORMAT  ( 1 5 . 1  5 / ( 1 CFIO. 4 ) ) 

DO  12  NN*1.NM 

n=nmax-nn 

FN=N 

DO  8  Jsl.N 
S»J 

9  V(J)  =  (S*(Al-fA2»V(  J-H  )  ) /FN)  +  (  l.-S/FN)*(  A3+A4*V(  J)  ) 

IF(V(J) -0  10.10.8 

10  V(J)  =  C 
IBDY(N) =J 
GO  TO  11 

8  CONTINUE 

11  IBDN=IBDY(N) 

IF(N-51 ) 300.300.301 

300  PRINT  20.  N.IBDN.(V( I ).I«1.IBDN) 

301  CONTINUE 

12  CONTINUE 

PRINT  105.  ( IBDY ( J ) . J»1 .NMAX ) 

105  FORMAT  (  IX. 2013. 2X/) 

21  CONTINUE 
CALL  EXIT 
END 


A.  17  Flow  Charts 


MAIN  1 


READ  PROCESS 
(IPUT) 


— 

, _ ^ 

( 

READ 

NOBS, 

IRUN 

IPRES, 

NOITS, 

PRINT 

PROCESS 

(0PUT(0,0); 


hoose  an  ini¬ 
tial  policy  to 
start  calcula¬ 
tions 

(IIX]M=ITER(6)) 


WASTE  25 
RANDOl  NUMBERS 


GENERATE  A 
SAMPLE  PROCESS 

(gen) 


FIND  OPTIMAL 
POLICY  OF 
EXPECTED 
PROCESS 
(GAIN=VALUE(4 


1=1+1 


:=Non 
.  ?  . 


SIMULATE  NOBS 
OBSERVATIONS 
UNDER  THAT 
POLICY 
(OBS(NOBS)) 


PRINT  I 
EFFICIENCY 
GAIN, 
.POLICY 
-.OPUT(lJm 
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MAIN  2 


MAIN  3 


PRINT 

FINAL 

matrices 

(OPUT(0,0)) 


SIMULATE 
FNOBS  OBSER¬ 
VATIONS 

(obs(nobs)) 


READ  IPRES, 
IRUN.  NOITS, 
alpha  BETA 
- ^ 


■>r  read  PROCESS 


WASTE  25 

RANDOM 

i’UMBERS 


CHOOSE  AN 
INITIAL 
POLICY 
(ITER  6) 


find  optimum 

[WEIGHTED  POLIC^ 
(VALUE(4)) 


I _ 


^bs=fnobs+ 

jjCMIN.|^kj 

qCMlNj 

fnobs=fnobs^ 

p.N^ 
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DISTRIBUTION  OF  GAIN 


READ  IRUN, 
MAX,  MTIM 


1" 


WASTE  25 
RANDOM  NUMBERS 

i 


READ  PROCESS 
(IPUT) 


RINC= 

RMAX-RMIN 

50 


GSUM=0 

GSQ=0 


G(I)=0 

1=1,51 


KKK-1 
(iterationj 
count) 

GENEFIATE  A 
SAMPLE  PROCESS 


(gen) 


PRINT 

RESULTS 


CONVERT  COUNTS 
TO  FREQUENCIES 

g(i)=g(i)/fmax 
- * - 


gvar«gsq/fmax 

-(gmean)^ 


GMEAN= 

gsum/pmax 


FIND  GAIN 

OF  SAMPLE 
PROCESS 

INCREMENT 
PROPER  CELL 
BY  1 

- 

T 

GSUM=GSUM 

■ 

GSQ=GSQ+ 

+GAIN 

1 

GAIN. GAIN 
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APPENDIX  B 


SUMMARY  OF  COMPUTER  RUNS 


B.  1  Runs  with  MAIN  I 

B,  1 .  1  Extreme  point  from  sample  process  (Run  I  ) 

We  chose  as  a  prior  the  following  matrix,  {  P. .  }  ,  with  its  correspond 

k 

ing  set  of  prior  parameters  N.  : 


{  p.*!  } 

_ u_ 


j  =  l 

j  =  2 

i  =  l  k  =  l 

1/3 

1/3 

k  =  Z 

1/10 

8/10 

k  =  3 

1/10 

1/10 

i=Z  k=l 

Z/10 

0 

k  =  Z 

8/10 

1/10 

i=3  k=l 

1/3 

1/3 

k  =  Z 

1/4 

1/Z 

k=  3 

8/10 

1/10 

i 


j  =  3 

1/3 

30 

1/10 

ZO 

8/10 

60 

8/10 

30 

1/10 

20 

1/3 

30 

1/4 

15 

1/10 

40 

As  the  actual  process  we  used  the  taxicab  example  presented  in 
the  introduction.  The  actual  process  will  be  seen  to  be  an  "extreme 
point "  from  the  prior.  As  expected  (  see  6.  Z  ),  only  policy  (1,1,1) 
with  an  actual  gain  of  9.  Z4  is  ever  chosen.  The  true  optimum  (  Z,  Z,  Z ) 
with  gain  of  13.  34,  is  never  discovered. 
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B.  1  ■  2  Randomly  chosen  processes  (Run  10) 

This  run  was  for  a  five-state  process  with  three  alternatives 

in  each  state.  Both  the  prior  probabilities  and  the  rewards  were  drawn 

from  random  number  tables.  Each  N.  was  chosen  to  be  25,  which  we 

1 

consider  expresses  a  small  degree  of  "certainty.  "  The  results  are 
summarized  in  the  table  following.  Notice  that  the  optimal  solution 
(or  one  very  close  to  it)  was  always  found  quite  rapidly,  and  that  effi¬ 
ciency  stays  in  the  high  90 's. 

B.  2  Runs  with  MAIN  II 

B.  2.  1  Run  2 

The  same  data  was  used  as  in  Run  1,  but  MAIN  11  was  used 
to  try  to  force  (2,2,2)  to  be  explored.  Our  parameters  were  a  =20, 
p  =  100.  In  25  iterations,  only  policy  (1,1,1)  was  chosen. 

B.  2.  2  Run  3 

We  used  the  same  data  again,  but  modified  our  parameters  to 
a  =2,  p  =  100.  Some  experimentation  was  introduced;  (1,1,1)  was  used 
for  the  first  19  iterations,  then  (1,  2,  1  )  with  gain  of  12.  50  is  tried  for 
6  iterations,  then  (  1, 2,  2)  with  gain  of  13.  15  is  tried  once.  But,  then 
the  process  locks  on  (1,1,1). 
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50  ITERATIONS  OF  20  OBSERVATIONS 


RUN  10  MAIN  1 


Sc'imple 

Point 

Optimal 

Policy 

Optimal 

Gain 

Max. 

Eff. 

Iteration 

# 

Min. 

Eff. 

Iteration 

# 

No 

Opts . 

Diff.  policies 
tried 

1 

1.3.  1.2,  1 

5.  68 

.9968 

50 

.  9677 

1 

45 

2 

2 

1.  3,  1, 2,  1 

5.  82 

.9990 

50 

.  9524 

1 

49 

2 

3 

1.  1.  1,2.  1 

5. 61 

1.0000 

1 

.  9886 

2 

38 

3 

4 

1.3.  1.2,  3 

5.  83 

.  9873 

50 

.9783 

1 

0 

2 

5 

1,  1,  1, 2,  1 

6.  20 

1.000 

50 

1 . 000 

1 

50 

1 
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B.  3 


Runs  with  MAIN  III 


B.  3.  1  F.un  4 

Same  data  as  Run  1,  with  parameter  a  =20. 


Iteration 

NOBS 

Policy 

£st,  gain 

Act  ga 

1 

54 

221 

8.  69 

8.  81 

2 

111 

1 1 1 

9.  38 

9.  20 

3 

34 

222 

5.  70 

13.  34 

4 

18 

222 

9.  26 

13.  34 

5 

8 

222 

10.  57 

13.  34 

6 

8 

222 

11.  76 

13.  34 

7 

5 

222 

10.90 

13.  34 

8 

5 

222 

11.12 

13.  34 

9 

4 

222 

11.14 

13.  34 

10 

4 

222 

11. 29 

13.  34 

11 

40 

212 

8.95 

8.  81 

Locks  on  222 
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B.3.2  Run  5:  MAIN  III,  same  data  as  Run  1 ,  p  =  0.4,  q  =  20 


Iteration 

NOBS 

Policy 

£st  gain 

Act  gain 

1 

54 

212 

8.  69 

8.  81 

2 

133 

111 

9.  38 

9.  20 

3 

40 

222 

5.  70 

13.  34 

4 

19 

222 

9.  46 

13.  34 

5 

37 

111 

10.  33 

13.  34 

6 

62 

212 

8.99 

8.  81 

7 

85 

222 

11.07 

- 

8 

138 

212 

9.09 

- 

9 

287 

122 

12.05 

13.  15 

10 

863 

111 

9.  29 

- 

11 

233 

222 

12.  85 

- 

12 

287 

222 

13.04 

- 

13 

708 

212 

8.99 

- 

14 

620 

222 

13.  24 

- 

15 

1418 

122 

13. 04 

13.  15 

16 

2004 

323 

8.  93 

8.  98 

17 

4406 

111 

9.  24 

- 

18  Lucks 

on  222  (  no . 

of  observations 

at  each 

iteration  grows 

larger  and  larger) 
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B.  3.  3  Run  6 


Same  data  as  Run  5,  but  parameters  a  =  20,  p=  0.  1.  This  now 
uses  (2,2,2),  most  of  the  time,  but  converges  slowly  to  it.  Begins 
with  (2,  1,2),  then  ( 1 ,  1 ,  1) ,  and  later  returns  to  (2,  1,2)  for  a  few  itera  - 
Lions.  Convergence  is  slow. 

B.  3.  4  Run  7 

Same  data  as  Run  6,  with  a  =20,  (3=  .01.  Same  general  shape 
as  Run  6,  but  there  are  more  observations  in  later  stages. 

B.  3.  5  Run  8 

More  weight  is  given  to  immediate  returns  by  making  a  =40,  p  =  .  1. 
See  graph  following  for  plot  of  efficiency  (see  Chapter  VI)  versus  accumu¬ 
lated  number  of  observations. 

B.  3.  6  Run  9 

Same  data,  a  =  10,  p  =  l.  See  graph  following. 
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Effi 
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Accum,  No.  of  Observations 


Efficiency 


1,000  10,000 
Accunx  No.  of  Observations  (not  including  prior) 


APPENDIX  C 


FINDING  A  SECOND  BEST  POLICY 


C.  1  Motivation 


In  Chapter  VII  we  discussed  the  advantages  of  being  able  to 
select  the  ten  best  policies  for  the  expected  process,  in  order  to 
simulate  and  find  more  explicit  information  about  these  most -likely 
candidates.  But  finding  the  ten  best  policies  hinges  upon  finding  an 
algorithm  to  find  the  second  best  policy. 


C.  2  Irapos Sibil ity  of  exact  solution 

Let  O  denote  the  optimal  policy  of  a  known  process  (  such  as 
the  expected  process),  and  let  N  denote  the  next,  or  second,  best 
policy.  Then  : 


Aon 

g  =  g  -  g 


is  to  be  a  minimum  for  N  O. 


Define ; 


7.  =  lq.  +  7  P..  v,]-[q,  +  >  p..  v  ] 


J 


J 


^e  know  that : 


o  o  o  \  o  o 

g  +  V.=q.+  >p..v. 

1  1  U  J 

j 


g 
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Thus : 


O  N  O  N  NVN0\NN 

c  -g  +v.  -  V.  =  7.  +>  p.  V.  -  >  p..  V. 

^  ^  1  1  1  <6  1  ij  1 

j  j 

N 

Multiplying  both  sides  by  tt  .  and  summing  over  i  ,  we  have: 

V  ,  O  N,  \  O  N  V  N  N  V  N  N 

)  (g  -  g)^.+  )  V.  TT.  -  )v.  TT.  =  )  7,  TT. 


VV  NNO  VV  N  N  N 

■z  Z  "ij  "i  "j  ■  Z  Z  Pij  "i  "j 


1  J 


1  J 


But  we  recall  that  tt  p  =  tt  ,  whence, 


VVnno  von 

/  p..7r.  v.=  /  V  TT. 

/  ij  1  ,)  j  J 


V  Y  N  N  N  \  N  N 

/  /  P..  TT,  V.  =  /  V.  TT. 

Z.  <6  IJ  1  J  L  )  J 

i  j  J 


So  finally, 


O  N  A 
g  -  g  =  g 


Y  N  N 

=  Z  "i 


Thus,  the  second  best  policy  is  that  which  minimizes; 


Y  N 


N  #  O 


But  minimizing  this  quantity  involves  a  search  thro  ’gh  all  possible 


sets  of  TT .  and  this  is  prohibitive. 
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C.  3  Approximatiuns  to  second  best  policy 


11  .  .  II 

We  can  get  an  approximation 
the  optimal  policy,  so  that 


by  changing  only  one  state  from 


i 


7 


N 

k 


0  i  t  k 

min  possible  for  that  state  >  0 


This  procedure  serves  to  minimize: 


1 


but  ignores  the  weighting  by  the  n  ‘s. 


C.  4  Justification  of  approximation 

This  approximation  procedure  may  be  used  as  follows: 

1.  Find  optimum  policy. 

2.  Perturb  each  state:  i.e.  ,  find  N  policies,  each  differing  from  the 

optimal  in  one  state.  Make  best  such  perturba*^’on  possible  in  each 

state  (that  is.  minimize  7.)- 

1 

3.  Apply  mean- variance  algorithm  to  these  N  policies. 


The  obvious  disadvantage  of  this  approach  is  that  the  N  policies  so 
determined  are  not  actually  the  N  best.  But  a  compensating  advantage 
is  that  at  least  2  alternatives  in  each  state  are  always  considered,  which 
encourages  wide  consideration  of  experimentation.  Also,  since  the  N 
best  policies  of  the  expected  process  are  not  necessarily  the  N  best  of 
the  actual  process,  it  is  not  so  critical  to  find  the  precise  N  best  ol 


-103- 


the  expected  process,  so  long  as  the  ones  chosen  embrace  a  wide 
variety  of  alternatives,  and  are  reasonable  candidates  for  inspec¬ 
tion.  This  approximation  technique  meets  these  demands. 
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